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Abstract 



This thesis consists of four chapters. The first two chapters pertain to the design 
of stable quantization methods for analog to digital conversion, while the third and 
fourth chapters concern problems related to compressive sensing. 

In the first chapter, we study the /3-encoder and golden ratio encoder, and show 
that these quantization schemes are robust with respect to uncertainty in the mul- 
tiplication element of their implementation, whereby x — > fix. In particular, we 
use a result from analytic number theory to show that the possibly unknown value 
of j3 can be reconstructed as the unique root of a power series with coefficients in 
{ — 1, 0, 1} obtained from the quantization output of test input x and its negative, —x. 

The focus of our attention in Chapter 2 is Sigma Delta (SA) modulation, a course 
quantization method in analog to digital conversion that is widely used for its sim- 
plicity of implementation and robustness to component imperfections. A persistent 
problem in SA quantization of audio signals is the occurrence of undesirable tones 
arising from periodicities in the bit output; these tones are particularly intolerable 
in the space between successive audio tracks. As one of the contributions of this 
thesis, we show that the standard second order 1-bit SA scheme can be modified 
so as to eliminate such periodicities, and this modification can be achieved without 
sacrificing accuracy or introducing significant complexity. 

The emerging area of compressed sensing guarantees that sufficiently sparse signals 
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can be recovered from significantly fewer measurements than their underlying dimen- 
sion suggests, based on the observation that an N dimensional real- valued vector can 
be reconstructed efficiently to within a factor of its best A;-term approximation by 
taking m = k log N measurements, or inner products. However, the quality of such a 
reconstruction is not assured if the underlying sparsity level of the signal is unknown. 
In Chapter 3, we show that sharp bounds on the errors between a signal and p es- 
timates (corresponding to a different sparsity hypotheses, say) can be achieved by 
reserving only O(logp) measurements of the total m measurements for this purpose. 
The proposed technique is reminiscent of cross validation in statistics and learning 
theory, and theoretical justification in the context of compressed sensing is provided 
by the Johnson Lindenstrauss lemma. 

In Chapter 4, we study the relation between nonconvex minimization problems aris- 
ing in compressed sensing and those known as free- discontinuity problems, which are 
often encountered in the areas of image segmentation and edge detection. We adapt 
recent thresholding techniques in the area of sparse recovery to a class of nonconvex 
functionals that contains both compressed sensing and free-discontinuity-type prob- 
lems. Along the way, we establish a connection between these two types of problems, 
using the notion of gamma convergence, and gain new insights into the minimizing 
solutions of such functionals. 
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"But what did the bird see in the clear stream below? His own image; no longer a 
dark, gray bird, ugly and disagreeable to look at, but a graceful and beautiful swan. ' 

- The Ugly Duckling 
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0.1 Overview 



The first part of the thesis is concerned with the acquisition of real- valued, continuous- 
time signals, the likes of which we unknowingly process every day: speech, music, 
video, and wireless communications are all produced in this form, for example. At 
the same time, it is becoming increasingly efficient to store and manipulate informa- 
tion in digital format. This is partly due to the greater robustness of digital signals 
versus analog signals with respect to noise and fluctuation: any slight variation in an 
analog signal can cause a considerable amount of distortion, but for digital sequences 
can be safely rounded off to the nearest element in the discrete alphabet. The design 
and implementation of fast, stable, and accurate algorithms for conversion between 
the continuous and discrete domain is therefore crucial. 

The process of analog -to -digital conversion usually consists of two parts: sampling 
and quantization. Sampling consists of converting the continuous-time signal of in- 
terest f(t) into a discrete-time signal f(t n ). According to the Shannon - Nyquist 
interpolation formula, this operation is invertible if the analog signal of interest is 
uniformly bounded and absolutely integrable, / G L°°(R) fl and bandlimited, 

meaning that its Fourier transform f(u) = f^° f(t)exp(—iu)t)dt vanishes outside 
a bounded interval [—Q/2,Q/2]. The importance of this result lies in the fact that 
many signals of practical engineering interest are well-approximated by bandlimited 
functions. For example, speech signals can be modelled as bandlimited functions 
whose bandwidth Q is 4 KHz, while audio signals are well modelled as bandlimited 
functions whose bandwidth is 20 KHz [59]. 
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Specifically, fi-bandlimited functions / G L°°(R) fl L 1 (IR) can be recovered from 
sufficiently dense evenly-spaced samples according to the following reconstruction 
formula, which holds for A > 1: 



here, the reconstruction filter g can be any function satisfying \\g\\oo < 1, g(w) = 1 for 
\lu\ < £1/2, and g(u>) = for \u\ > Xfl/2. At critical sampling rate A = 1, or Nyquist 
rate, these restrictions determine the Fourier transform g as the indicator function 
for the interval [—Q/2,Q/2], yielding the sine filter, g(t) = f2sinc(f2t) = sin (QiTt)/nt. 
As sinc(t) is not absolutely summable, sampling at the critical rate A = 1 renders 
the reconstruction (1.1) unstable in the presence of inevitable additive error on the 
samples; if noisy input /„ = /(^r) + e n is instead observed, and |e n | = e, then the 
sign pattern of the noise sequence (e n ) can be chosen adversarily so that the resulting 
series J2 n f n g(t — ^) does not even converge [31]. 

In practice, reconstruction is stabilized by oversampling at a fixed rate correspond- 
ing to A > 1. In this setting, one has the freedom to design g so that its Fourier 
transform g is smooth, in which case g will have fast decay, and only those samples 
/(^r) for which |^ — t\ is sufficiently small will contribute significantly towards the 
reconstruction of f(t). 

Quantization in analog to digital conversion 




(1) 



n 
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The second step in analog to digital conversion is quantization, whereby the real- 
valued sequence (/*) = is encoded in a bitstream (q( n ,m)), with K bits 
(<?(n,m))m=i allocated to each sample From these bits, the original function /(£) 
can be reconstructed in the analog domain, albeit imperfectly, by replacing each 
sample f£ in (1.1) by a function of the if- bit sequence (?( n ,m))m=i- For example, 
this sequence could be taken to be the first K bits in the binary expansion of 
- such pulse code modulation quantization schemes are often employed in practice. 
In Chapter 1, we will study a related quantization scheme, the /3-encoder, whereby 
expansions of in base 1 < (3 < 2 are considered in place of binary expansions 
(where (3 = 2). Note that quantization, unlike sampling, is no longer an invertible 
operation. However, in any reasonable quantization scheme, the accuracy of the ap- 
proximation to / from the quantization sequence (q( n ,m)) will increase as as function 
of B = K\, the number of bits spent per Nyquist interval. That is, oversampling 
allows for more accurate as well as more stable reconstruction, but at the expense 
of additional sampling resources. The balance between accuracy, stability, and sam- 
pling efficiency in signal acquisition will be a recurring theme throughout this thesis. 

In Chapters 1 and 2, we focus on the design of stable quantization schemes, as- 
suming that we may spend many bits per Nyquist interval in acquiring samples (/*). 
This is the case for audio signal processing, where signals of interest have reason- 
able bandwidth f2 < 40 KHz, and sampling at a rate of e.g. 32 bits per Nyquist 
interval corresponds to using fewer than 1 million bits per second. In determining 
the stability of a particular quantization scheme to nonidealities in its implemen- 
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tation, of particular importance is its sensitivity to imperfections in the quantizer 
element Q :R — > {— 1,1} that is necessary for conversion from the analog to digital 
representation. Q is usually taken to be the signum function Q(u) = sgn(w), the 
behavior of which an actual quantizer, having finite precision and subject to thermal 
fluctuations, can only approximate. Actual quantizer elements usually come with a 
pre-assigned tolerance e: for input with magnitude below this tolerance, \u\ < e, the 
output Q e (u) G { — 1, 1} is not reliable. We will say that a quantization scheme is 
robust with respect to quantization error if, for some e > 0, the worst approxima- 
tion error produced by the quantization scheme can be made arbitrarily small by 
allowing a sufficiently large bit budget, even if the quantizer used in its implemen- 
tation is imprecise to within a tolerance e. Such robustness can be achieved if the 
quantization sequence (q( n ,m)) constitutes a redundant representation of the original 
f(t), so that an incorrect bit q'^ mn ^ caused by quantization error can be redeemed 
by an appropriate choice of subsequent bits. Of course, any quantization scheme of 
interest will require additional components in its implementation, such as adders, 
multipliers, integrators, and so forth; robustness with respect to imperfections in all 
these elements must be taken into account in evaluating the reconstruction accuracy 
of a particular scheme. 

In Chapter 1, we investigate robustness properties of the /3-encoder, a quantization 
scheme that was recently designed to be robust with respect to quantization errors, 
while also having asymptotically optimal reconstruction accuracy. We show that the 
/3-encoder is also robust with respect to the other components in its implementation, 
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securing the validity of its reconstruction accuracy. Interestingly, such robustness is 
achieved by enforcing a slight shift in the quantizer element, Q(u) = sgn-u + r; that 
is, we take advantage of the freedom to choose an imperfect quantizer, offered by 
redundancy, in order to ensure additional robustness guarantees. 

In Chapter 2, we focus on a different notion of stability in the design of quantization 
schemes. Sigma Delta (SA) quantization, and in particular one-bit SA modulation, 
is a coarse quantization method, as each sample value f£ in (1.1) is replaced by a 
single bit q n G {—1,1} that depends on all previous f£. Because of their robustness 
properties and ease of implementation, SA schemes are widely popular in practice, 
despite suffering from an entirely different kind of problem: in audio signal process- 
ing, periodic oscillatory patterns in the bit output q n often occur, producing idle tone 
components in response to stretches of zero input = 0, or even small amplitude 
sinusoidal inputs. Such tones are audible to the listener, and are so pronounced in 
low-order SA quantizers that such schemes are often avoided in audio applications 
for more complex higher order SA schemes [51]. As one of the contributions of this 
thesis, we introduce a second-order SA scheme that eliminates such periodicities by 
damping the internal variables of the system once the input f£ is identically zero. 
This approach is in line with the philosophy of Chapter 1: to eliminate the idle tones, 
we exploit the freedom to induce a small amount of leakage to the system without 
sacrificing accuracy, as offered by redundancy of the SA quantization output. 

Beyond Nyquist sampling 
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The second part of the thesis concerns compressed sensing, a fast emerging area of ap- 
plied mathematics that represents a new approach to traditional sampling paradigms. 
At the core of this field is the remarkable fact that a large class of underdetermined 
linear systems y = §x are invertible, and can be inverted efficiently, subject to 
the constraint that x is sufficiently sparse, or has sufficiently small nonzero support 
||rr|| = \{i : Xi 7^ 0|}. In particular, a matrix <£> G ]R mxAr with unit normed columns 
is said to have the {k, 5)-restricted isometry property, or satisfy /c-RIP for short, if 
all singular values of any k column sub matrix of <3> lie in the interval [1 — 5, 1 + 5} for 
a given constant 5 < 1. For 2/c-RIP matrices, a /c-sparse vector x can be recovered 
from its lower-dimensional image y = Qx as the minimal i\ norm solution, 



That is, 2/c-RIP for the matrix <3> furnishes an equivalence between the minimal £o 
norm and minimal t\ norm solution from the affine space {u G M N : $w = y} 
if a /c-sparse solution exists; while finding the former solution represents an NP- 
hard problem in general, the latter solution can be recovered efficiently using linear 
programming methods. Even more can be said: for vectors x satisfying y = $x 
that are approximately but not exactly /c-sparse, the error \\x — x\\ £ n incurred by the 
approximation x = argmin$ u=2; using 2/c-RIP matrix $ is still small, thanks to 
the following stability result: 



x = arg min ||u|| 



(2) 



X — x\\t>N < 



2 




inf \\x — z\\ e N. 



(3) 



7 



In words, the error \\x — x\\ £ n is bounded by a fraction of the best possible approxi- 
mation error between x and the set of /c-sparse vectors in the metric of £^ [15]. 

Constructing A>RIP matrices With high probability, an m x N random matrix 
whose coefficients are drawn as independent and identical realizations of a Gaussian 
or Bernoulli random variable will satisfy k-RIP of optimal order, 

k = 0(m/ log (N/m))). (4) 

While no deterministic constructions of RIP matrices have been found of this order, 
many matrices with a smaller degree of randomness satisfy the restricted isometry 
property to almost optimal order. Most notably, with high probability an m x N 
matrix obtained by selecting m rows at random from the N x N discrete Fourier 
matrix satisfies /c-RIP of order k = 0(m/(log AHog (log iV))) [66]. This particular 
result has an interesting interpretation that connects back to the first two chapters: 
signals that are well-approximated as elements from the class of sparse trigonometric 
polynomials, 

fit) = ^ aul exp (2-niut) , t G [0, 1) , o u G R, ||a|| < d, (5) 

oj=-N 

can be efficiently reconstructed from only m = 0(d log 3 N)) samples; when d is small, 
this represents an exponentially smaller number than the O(N) samples required for 
the Shannon- Whittaker reconstruction formula (1.1) to hold. This is significant in 
applications such as radar, navigation, and satellite communications, where signals 
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of interest often have a sparse frequency representation, while measurements are at 
the same time expensive to implement. 

Cross validation in compressed sensing 

From the reconstruction formula (3) and order relation (4), the quality of the ap- 
proximation x = arg mm$ u=y \\u\\i to an unknown signal x in the affine space $x = y 
depends on the approximability of x is by its k = log ^ m largest coordinates. In 
the literature, a known and fixed value for k is assumed to well-approximate all sig- 
nals in the class of interest, while more realistically, k may be a parameter of the 
unknown input. Natural images, for instance, are generally well-approximated by 
only a few discrete Wavelet basis elements [37]; however, certain heavily textured 
images, such as those containing hair or sand, are not particularly compressible in 
such bases. In statistics, parameter selection and noise estimation are commonly 
achieved through cross validation, whereby the available data is separated into in- 
dependent testing and training sets. In Chapter 3, we show that cross validation 
incorporates naturally into the compressed sensing paradigm, as the random mea- 
surements often used for compressed sensing are the same measurements that provide 
almost-isometric lower dimensional embeddings of generic point sets, as guaranteed 
by the Johnson-Lindenstrauss lemma. More precisely, we take a set of m measure- 
ments of the unknown x, and use m — r of these measurements, $x, in a compressed 
sensing decoding algorithm to return a sequence (xi,X2, •••) of candidate approxima- 
tions to x. The remaining r measurements, ^x, are then used to identify from among 
this sequence a 'best' approximation x = Xj, along with an estimate of the sparsity 
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level of x. The proposed method for error estimation in compressed sensing is ex- 
tremely cheap: approximation errors of up to p distinct approximations (xj)? =1 can 
be estimated with high accuracy at the expense of only 10 log p samples. Whereas 
in Chapters 1 and 2 we exploited redundancy of representation afforded by over- 
sampling to provide stability results for quantization schemes, we now exploit the 
stability of random measurements to validate the assumption that the representation 
at hand is redundant. 

Compressed sensing and free-discontinuity problems 

Chapter 4 explores the connection between minimization problems arising in com- 
pressed sensing and those corresponding to free- discontinuity problems, which de- 
scribe situations where the solution of interest is defined by a function and a lower 
dimensional set consisting of the discontinuities of the function. Such situation arise, 
for instance, in crack detection from fracture mechanics [65] or in certain digital 
image segmentation problems [41]. The best-known example of a free-discontinuity 
problem is that of minimizing the so-called Mumford-Shah functional [22], which is 
defined by 



here, the set f2 is a bounded open subset of R , the constants a, f3 > are fixed, 
g E L°°(Q), and H N denotes the iV-dimensional Hausdorff measure. In the context 
of visual analysis, the function g is a given noisy image that is to be approximated 
by the minimizing function u G W 1,2 {VL \ K); the set K is simultaneously is used in 
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order to segment the image into connected components. 

The Mumford-Shah functional J(u, K) is not easily handled, theoretically or nu- 
merically speaking. This difficulty has given rise to the development of approxima- 
tion methods for the Mumford-Shah functional and its minimizers where sets are no 
longer involved [22]. In the one- dimensional setting, minimizers of J(u,K) can be 
approximated in a strong sense by minimizers of a discrete functional which, refor- 
mulated in terms of the discrete derivative, and generalized to incorporate inverse 
problems, takes the form of a selective least squares problem, 

N 

Minimize J r {u) := \\Tu — f\\gm + min { |-Uj| 2 , r} . (6) 

i=i 

Note that the unknown discrete derivative vector, assumed to be 'small' and smoothly 
varying except on the discontinuity set of the solution, should be well-approximated 
by the minimizer of J r . However, despite successful numerical results [24] observed 
for such selective least squares problems, no rigorous results on the existence of 
minimizers, let alone on the convergence of several proposed algorithms to such min- 
imizers, are currently available in the literature. 

In Chapter 4, we establish a connection between the selective least squares prob- 
lem (6) and the £ minimization problem from compressed sensing, 

Minimize J r (u) := \\Tu - f\\] m + ||u|| (7) 
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using the notion of gamma convergence. Moreover, we derive an iterative thresh- 
olding algorithm for finding solutions to the selective least squares functional J r , 
motivated by the recent application of such algorithms for minimizing the £q regu- 
larized functional (7). Our algorithm is shown to converge to local minimizers u of 
the functional J r that are characterized by certain fixed point conditions, including 
the following gap property: for each j, either \uj\ > \/2r or \uj\ < A^r. We show 
in addition that any global minimizer of J T must satisfy such fixed point conditions, 
giving mathematical justification to the observation that minimizers of J r tend to 
be "cartoon" -like, or segmented into regions of small gradient, separated by edges 
corresponding to large gradient. Moreover, we show that minimizers are restricted 
to regions where the non-convex functional J r is locally strictly convex. Thus, global 
minimizers are always isolated, although not necessarily unique, whereas local mini- 
mizers may constitute a continuum of unstable equilibria. These observations sheds 
light on fundamental properties, virtues, and limitations, of regularization by means 
of the Mumford-Shah functional, and provide a rigorous justification of the numerical 
results appearing in the literature. 



12 



Contributions of this thesis 



Chapter 1. The /3-encoder and golden ratio encoder were recently introduced as 
quantization schemes in analog to digital conversion that are simultaneously robust 
with respect to quantizer imperfections in their analog implementation and efficient 
in their bit-budget use. However, it was not clear at first [34] that these schemes 
were also robust with respect to imprecisions in the multiplier element. Using a result 
in analytic number theory concerning the roots of power series with coefficients in 
{ — 1, 0, 1}, we show that these schemes are indeed robust as such, and the underlying 
value of (3 can be reconstructed as the unique root on [0, 1] of a power series obtained 
from the quantization output of test input (x, —x). 

Chapter 2. Sigma Delta (XA) modulation is a course quantization method in 
analog to digital conversion that is widely used for its simplicity of implementation 
and robustness properties; however, a persistent problem in SA modulation is the 
occurrence of undesirable idle tones arising from oscillatory patterns in the bit output. 
Such oscillations are omnipresent, and automatically arise along stretches of zero 
input, the likes of which are unavoidable in audio signal processing. Responding 
to a question left open in [79], we introduce a family of quiet second-order 2-bit 
asymmetrically damped SA schemes whose quantization output becomes constant 
at the onset of zero input, effectively eliminating the idle tones that arise in this 
context. 

Chapter 3. The emerging area of compressed sensing boasts efficient sensing tech- 
niques based on the phenomenon that an N dimensional real-valued vector can be 
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reconstructed efficiently to within a factor of its best /c-term approximation error by 
taking only m = klogN measurements. However, the quality of such a reconstruc- 
tion is not assured if the underlying sparsity level of the signal is unknown. We show 
that sharp bounds on the errors between a signal and p approximations to the signal 
(corresponding to a different sparsity hypotheses, say) can be achieved by reserving 
only O(logp) measurements of the total m measurements for this purpose. This 
error estimation technique is reminiscent of cross validation in statistics and learning 
theory, and theoretical justification in the context of compressed sensing is provided 
by the Johnson Lindenstrauss lemma. 

Chapter 4. Inspired by recent thresholding techniques in compressed sensing, we 
develop an algorithm for minimizing a discrete version of the Mumford Shah (MS) 
functional arising in image segmentation [21]. Although regularization methods in- 
volving similar nonconvex functionals work well in practice [63], ours is the first 
proven to converge to local minimizers of the discrete MS functional. The proposed 
algorithm can be adapted to a more general class of nonconvex functionals that 
includes the ^-regularized functional from compressed sensing. Finally, we show 
that the £ regularized functional can be viewed as the limit of a sequence of free- 
discontinuity type problems, illuminating an intimate connection between the two 
problems. 
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Chapter 1 



Robust quantization in analog to 
digital conversion 

1.1 Analog to digital conversion: an introduction 

Digital signals are omnipresent; one important reason is that transmission and stor- 
age for digital signals is much more robust with respect to noise and fluctuation than 
for analog signals. Because many signals of interest to us are produced in analog 
form, a transition from analog to digital is therefore necessary. 

Analog-to-digital (A/D) conversion consists of two parts: sampling and quantization. 
The process of sampling, followed by quantization, can be schematically represented 
by 

/(*) (f(tn)) nez ((5(n,m))£=l) nez 
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where q( n>m ) = (?(n,i), Q(n,2), ■■■,Q(n,K)) is a finite sequence of digits from a finite (and 
usually binary) alphabet. It is reasonable to assume that the real- valued analog signal 
of interest, /, is uniformly bounded and absolutely integrable, / G L°°(M) fl L 1 (M), 
and is bandlimited; i.e., its Fourier transform f(uu) = f(t) exp {—iut)dt vanishes 
outside some bounded interval [—Q/2,Q/2]. Under this assumption, the sampling 
step in A/D conversion is an invertible operation, as fit) can be reconstructed from 
the samples (/(^f)) neZ as l° n g as A > 1, via the reconstruction formula 

. . 1 v-^ „/mr\ / mr \ 

/W = 5 E/GnM'-An)- < L1 > 

n 

Above, g can be any function such that \\g\\oo < 1,<7 — 1 for |£| < f2 and <? = 
for |£| > Af2. If <? is smooth, then g will have fast decay and the reconstruction 
(1.1) will be almost entirely local [31]. By scaling and normalizing appropriately, we 
can reduce any bandlimited function in L°°(M) fl L 1 (IR) to an element of the class 
S = {/ : ll/lloo < 1, ||/||i < l,supp / G [— 7r,7r]}, in which case the reconstruction 
formula (1.1) reduces to 

n 

The second step in A/D conversion is quantization, in which sample values f(j) are 
replaced by finite bitstreams (q( n ,m))m=i an< ^ associated function A : {0, 1} K — > R, so 
that the original function can be reconstructed at a later time by the approximation 

/(^^A^ft-^). (1.3) 

n 
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Quantization, unlike sampling, is not in general an invertible operation. That is, if 
E is the operator which maps functions in S to bitstreams (the encoding operator) 
associated with a particular quantization scheme, and D is the decoding operator 
which maps bitstreams back to functions in S, then typically / 7^ D(Ef). In order 
to measure the performance of a particular quantization scheme, one considers the 
distortion of the scheme, given by d(S;E,D) : = supj g5 ||/ — D(Ef)\\, where |.| is 
the norm of interest; e.g., the L°° norm on R. In any reasonable quantization scheme, 
the distortion will decrease as the number of bits per unit interval (the so-called bit 
budget) increases; schemes whose distortion decreases faster as a function of the bit 
budget are generally considered superior quantization schemes. 

If the relation between the distortion d and the bit budget B were the only im- 
portant measure associated with a quantization scheme, then the widely-used pulse 
code modulation (PCM) quantization algorithm would always be preferred in prac- 
tice. Given a function / in S, a TV-bit PCM algorithm simply replaces each sample 
value f(j) with N bits: one bit for its sign, followed by the first TV — 1 bits of the 
binary expansion of )|. One can show that for signals / in S, and for fixed A > 1, 
the distortion d(f, E, D) < C X 2~ N = C A 2~ B/A for an TV-bit PCM, when, for instance, 
||.| is the L°° norm on R. 

Sigma Delta (EA) quantization 

Another popular quantization algorithm, EA modulation, has distortion that decays 
only like an inverse polynomial in the bit budget B. One-bit EA modulation replaces 
each sample in (1.1) by a single bit, b n e { — 1, 1} (In multibit EA schemes, the 
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coefficients b(n) replacing the can assume a larger range of discrete values, and 
thus require several bits). In contrast to PCM, the b n in EA depend not only on 
but on all previous f£, k < n. Consider first-order, one-bit EA, where the b n are 
generated recursively by the scheme, 

bn = Q(u n -l + fn) 

U n = Un-l + /£-&n- (1-4) 

The quantizer Q(x) = sgn(x) just returns the sign of its argument, which can be 
taken to be either 1 or —1 when x — 0. Assuming |/*| < 1 (which is true of 
functions / e S), and initializing |u | < 1, a simple inductive argument guarantees 
that \u n \ < 1 for all n [31]. Using this, along with the observation that B = A for one- 
bit quantization schemes, we arrive at an error estimate ||/ — / s ||oo < 2||^|||ii? _1 
for the first-order EA scheme (1.4). More generally, the K th order analog of the 
recursion (1.4) equates f£ — b n to the K th order (as opposed to first order) difference 
of a bounded sequence (u n ) ne z, and has corresponding decay ||/ — / B ||oo < Ck, 9 B~ k ; 
in particular, the second-order one-bit EA quantization scheme can be recast in the 
form 

b n = Q(F(u n -l,V n -l, fn)) 
U n = Mn-1 + fn -b n 

V n = V n -1+U n . (1.5) 
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The function F can be any map that guarantees the boundedness of the state vari- 
ables (u n ,v n ); for example, the linear function F(u n ,v n ) = u n + ^yv n suffices for an 
appropriate range of 7 [79]. 

Robustness for SA quantization 

Clearly, the 0{2~ B ) decay of PCM will outperform the 0(B~ K ) decay of a K-th or- 
der one-bit SA scheme, for any order K. Yet, SA schemes of as low as second order 
are preferred in practice over high order PCM quantization schemes. This can be 
partly understood by the fact that SA schemes are less sensitive than PCM schemes 
to inevitable errors in the analog components of their implementation. The notion 
of robustness of SA schemes to quantization error, which was first studied in [31], 
is connected to how well SA schemes exploit the redundancy of the representation 
f(t) = \ c n,g(t — 1„) afforded by oversampling A > 1. By redundancy, we refer to 
the fact that the functions (<7n(0) ne z = {g(t — tn)) n< =z form a frame 1 for the Hilbert 
space of fi-bandlimited functions, and their frame redundancy increases with A. One 
source of quantization error incurred in both SA and PCM quantization comes from 
the quantizer function itself: the function Q(u) = sgnw that is used in both the 
SA schemes and in the recursive algorithms used to generate binary expansions in 
PCM cannot be built to have infinite precision. In fact, quantizer elements for A/D 
circuits generally come with a prescribed tolerance v for which the output Q(u) of 
such quantizers should not be trusted once \u\ < v (and of course, quantizers with 

1 Technically, the shifts of g are not a frame for the space of fi-bandlimited functions because 
these functions don't live in this space. But this is only a technicality due to the definition of a 
frame; the function g lives in the larger Hilbert space of f2A-bandlimited functions, or simply L 2 
for that matter, and its shifts satisfy the frame property for the smaller space of fi-bandlimited 
functions. 
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lower tolerance are more expensive). As the binary expansion of almost every real 
number is unique, an incorrect bit assignment, and especially an incorrect initial bit, 
in the truncated binary expansion of a sample f£ will cause an error in the resulting 
approximation that obviates the possibility of 0(2~ B ) decay. £A schemes, on the 
other hand, keep track of prior samples f£,k<n,in such a way that errors caused 
by imprecise quantizers can be corrected later, and the 0(B~ K ) maintained, by ex- 
ploiting the redundant information in the samples [31]. 

This discussion brings us to the question: Is it possible to have the best of both 
worlds^ That is, can one design a quantization scheme that has exponential recon- 
struction guarantees of PCM while also being robust with respect to quantization 
imperfections like EA? This question was answered affirmatively in [33], with the 
introduction of the /3-encoder. 

1.2 On the robustness of beta-encoders and golden 
ratio encoders 

Beta-encoders are similar to PCM in that they replace each sampled function value 
x = fn with a truncated series expansion in a base (3, where 1 < j3 < 2, and with 
binary coefficients. Clearly, if (3 — 2, then this algorithm coincides with PCM. How- 
ever, whereas the binary expansion of almost every real number is unique, for every 
(3 G (1,2), there exist a continuum of different (3 expansions of almost every x in 
(0, 1] (see [69]). It is precisely this redundancy that gives beta-encoders the freedom 
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to correct errors caused by imprecise quantizers shared by EA schemes. Whereas in 
EA, a higher degree of robustness is achieved via finer sampling, beta-encoders are 
made more robust by choosing a smaller value of (3 as the base for expansion. 

Although beta-encoders as discussed in [33] are robust with respect to quantizer 
imperfections, these encoders are not as robust with respect to imprecisions in other 
components of their circuit implementation. Typically, beta-encoders require a mul- 
tiplier in which real numbers are multiplied by p. Like all analog circuit components, 
this multiplier will be imprecise; that is, although a known value (3 may be set in 
the circuit implementation of the encoder, thermal fluctuations and other physical 
limitations will have the effect of changing the true multiplier to an unknown value 
P ^ [Plow, ft high] within an interval of the pre-set value P . The true value ft will vary 
from device to device, and will also change slowly in time within a single device. This 
variability, left unaccounted for, disqualifies the beta-encoder as a viable quantization 
method since the value of (3 must be known with exponential precision in order to re- 
construct a good approximation to the original signal from the recovered bit streams. 

We overcome this potential limitation of the beta-encoder by introducing a method 
for recovering (3 from the encoded bitstreams of a real number x e [—1,1] and its 
negative, —x. Our method incorporates the techniques used in [34], but our analy- 
sis is simplified using a transversality condition, as defined in [70], for power series 
with coefficients in {—1,0,1}. As the value of (3 can fluctuate within an interval 
[Plow, Phigh] over time, our recovery technique can be repeated at regular intervals 
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during quantization (e.g., after the quantization of every 10 samples). 



The golden ratio encoder (GRE) was proposed in [35] as a quantizer that shares 
the same robustness and exponential rate-distortion properties as beta-encoders, 
but that does not require an explicit multiplier in its circuit implementation. GRE 
functions like a beta-encoder in that it produces beta-expansions of real numbers; 
however, in GRE, (3 is fixed at the value of the golden ratio, (3 = <p = 1+ 2 V ^ . The 
relation 2 = + 1 characterizing the golden ratio permits elimination of the mul- 
tiplier from the encoding algorithm. Even though GRE does not require a precise 
multiplier, component imperfections such as integrator leakage in the implemen- 
tation of GRE may still cause the true value of (3 to be slightly larger than 0; in 
practice it is reasonable to assume (3 € [0, 1.10]. Our method for recovering (3 in gen- 
eral beta-encoders can be easily extended to recovering (3 in the golden ratio encoder. 

Our work in this section will be organized as follows: 

1. In sections 1.2.1 and 1.2.2, we review relevant background on beta-encoders 
and golden ratio encoders, respectively. 

2. In section 1.2.3, we introduce a more realistic model of the golden ratio encoder 
that takes into account the effects of integrator leak in the delay elements of 
the circuit. We show that the output of this revised model still correspond to 
truncated beta-expansions of the input, but in an unknown base (3 that differs 
from the pre-set value. 
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3. Section 1.2.4 describes a way to recover the unknown value of f3 up to arbitrary 
precision using the bit streams of a 'test' number x G [—1,1], and —x. The 
recovery scheme reduces to finding the root of a polynomial with coefficients 
in {-1,0,1}. 

4. Section 1.2.5 extends the recovery procedure of the previous section to the 
setting of beta-encoders having leakage in the (single) delay element of their 
implementation. We show that the analysis in this case is completely analogous 
to that of Section 1.2.4. 

1.2.1 The beta-encoder 

In this section, we summarize certain properties of beta-encoders (or /3-encoders) 
with error correction, from the perspective of encoders which produce beta expan- 
sions with coefficients in {—1,1}. For more details on beta-encoders, we refer the 
reader to [33] . 

We start from the observation that given (3 G (1,2], every real number x G [— ^-j-, ^-j-] 
admits a sequence (6j)jgjv, with bj G { — 1,1}, such that 

oo 

•'• (1-6) 

Under the transformation bj = (1.6) is equivalent to the observation that every 
real number y G [0, -^-j-] admits a beta-expansion in base (3 G (1, 2], with coefficients 
bj G {0, 1}. Accordingly, all of the results that follow in this section have straight- 
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forward analogs in terms of {0, l}-beta-expansions; see [33] for more details. 



One way to compute a sequence {bj)j^N that satisfies (1.6) is to run the iteration 

u\ = (3x 

h = Q{ui) 

for j > 1 : u j+1 = p{y.j - bj) 

bj+i = Q(u j+ i) (1.7) 

where the quantizer Q is simply the sign-function, 



-1, M<0 

Q{u) = { (U 
1, u>0. 



For (3 = 2, the expansion (1.6) is unique for almost every x G [—1,1]; however, 
for j3 G (1,2), there exist uncountably many expansions of the type (1.6) for any 
x G [—1,1] (see [69]). Because of this redundancy in representation, beta encoders 
are robust with respect to quantization error, while PCM schemes are not. We now 
explain in more detail what we mean by quantization error. The quantizer Q in 
(1.8) is an idealized quantizer; in practice, one has to deal with quantizers that only 
approximate this ideal behavior. A more realistic model is obtained by replacing Q 
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in (1.8) with a 'flaky' version Q u , for which we know only that 



-1, 



u < —v 



Q» = < 



i, 



U > V 



(1.9) 



— 1 or 1, — i/ < u < v. 



In practice, v is a quantity that is not known exactly, but over the magnitude of 
which we have some control, e.g. \v\ < e for a known e. This value e is called the 
tolerance of the quantizer. We shall call a quantization scheme robust with respect to 
quantization error if, for some e > 0, the worst approximation error produced by the 
quantization scheme can be made arbitrarily small by allowing a sufficiently large 
bit budget, even if the quantizer used in its implementation is imprecise to within 
a tolerance e. According to this definition, the naive { — 1, l}-binary expansion is 
not robust. More specifically, suppose that a flaky quantizer Q v is used in (1.7) to 
compute the base-2 expansion of a number x G [—1,1] which is sufficiently small 
that \lx\ < v. Since 2x is within the flaky zone for Q u , if b\ = Q u (2x) is assigned 
incorrectly; i.e., if b\ differs from the sign of x, then no matter how the remaining bits 
are assigned, the difference between x and the number represented by the computed 
bits will be at least \x\. This is not the case if 1 < (3 < 2, as shown by the following 
whose proof can be found in [33]: 

Theorem 1.2.1. Let e > and x G [—1, 1]. Suppose that in the beta-encoding of x, 
the procedure (1.7) is followed, but the quantizer Q v is used instead of the ideal Q at 
each occurence, with v satisfying v < e. Denote by (bj)j e N the bit sequence produced 
by applying this encoding to the number x. If (3 satisfies 
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1< p < 



2+e 
e+1' 



then 



N 



X — 



—N 



(1.10) 



C = e + 1 . 

For a given tolerance e > 0, running the recursion (1.7) with a quantizer Q u of 
tolerance e and a value of f3 in (1, f^f ) produces bitstreams (bj) corresponding to a 
beta-expansion of the input x in base (3; however, the precise value of (3 must be 
known in order recover x from such a beta-expansion. As mentioned in the introduc- 
tion and detailed in the following section, component imperfections may cause the 
circuit to behave as if a different value of (3 is used, and this value will possibly be 
changing slowly over time within a known range, [(3i ow , (3hi g h\- In [34], a method is 
proposed whereby an exponentially precise approximation 7 to the value of 7 = (3~ l 
at any given time can be encoded and transmitted to the decoder without actually 
physically measuring its value, via the encoded bitstreams of a real number x G [0, 1) 
and 1 — x. This decoding method can be repeated at regular time intervals during 
the quantization procedure, to account for the possible time-variance of 7. That an 
exponentially precise approximation 7 to 7 is sufficient to reconstruct subsequent 
samples f(t n ) with exponential precision is the content of the following theorem, 
which is essentially a restatement of Theorem 5 in [34] . 

Theorem 1.2.2 (Daubechies, Yilmaz). Consider x G [0, 1) and (6j)j e jv G {0, 1}, or 
x G [—1, 1] and (6j)jgjv £ { — 1) !}■ Suppose 7 G (1/2, 1) is snc/i that x = Yl'jLi ^'T 5 '- 
Suppose 7 is snc/i i/iai |7 — 7] < /or some /ixed Ci > 0. T/ien x N := ^2f =1 bff* 
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satisfies 

\x-x N \<C 2 i N (1.11) 

where C 2 is a constant which depends only on 7 and C\ . 

Although the approach proposed in [34] for estimating (3 from bitstreams over- 
comes potential approximation error caused by imprecise multipliers in the circuit 
implementation of the beta-encoder, new robustness issues are nevertheless intro- 
duced. Typically, one cannot ensure that the reference level 1 in 1 — x is known with 
high precision. To circumvent this problem, the authors consider other heuristics 
whereby (3 is recovered from clever averaging of multiple pairs Xj and 1 — Xj. These 
heuristics do not require that the reference level 1 in 1 — x be precise; however, these 
approaches become quite complicated in and of themselves, and any sort of analyti- 
cal analysis of their performance becomes quite difficult. As one of the contributions 
of the present work, we present an approach for recovering (3 that is inspired by the 
approach in [34] but does not require a precise reference level, yet still allows for 
exponentially precise approximations to (3. 

1.2.2 The golden ratio encoder (GRE) 

As shown in the previous section, beta-encoders are robust with respect to imperfect 
quantizer elements, and their approximation error decays exponentially with the 
number of bits N. To attain this exponential precision, (3 must be measured with high 
precision, which is quite complicated in practice. These complications motivated the 
invention of the golden ratio encoder (GRE) of [35], which has the same robustness 
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and rate-distortion properties as beta-encoders, but uses an additional delay element 
in place of precise multiplier in its implementation. That is, if one implements the 
recursion (1.7) with f3 = 4> = ^2^1 then using the relation 2 = + 1, one obtains 
the recursion formula u n+ i = w n _i + u n — (& n _i + (fib n ). If the term b n _\ + <pb n in 
this formula is removed, then the resulting recursion v n +i — v n + v n -\ should look 
familiar; indeed, with initial conditions (i>o,i>i) = (0, 1), this recursion generates the 
Fibonacci numbers v n , and it is well-known that the sequence — > <fi as n — > 00. 
If b n -i + (f>b n is instead replaced by a single bit taking values in { — 1,1}, then we are 
led to the following scheme: 

Uq = X 

u x = 
for n > : b n = Q(u n ,u n+1 ) 

U n+2 = Mn+1 +U n -b n (1-12) 

In this paper, we will consider quantizers Q in (1.12) of the form Q a , where 



-1, u + av < 
Q a (u,v) = { (1.13) 

1, u + av > 
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along with their flaky analogs, 



— 1, u + av < —v 

1, u + av>v (1-14) 
— 1 or 1, — i/ < w + ati < v 



In [35], the authors consider the recursion formula (1.12) implemented with flaky 
{0, l}-quantizers of the form Q% L (u, v) = Q u (u + av — i) + lj /2. For the simplicity 
of presentation, we will consider only the { — 1, l}-quantizers (1.14), but many of our 
results extend straightforwardly to quantizers of the type Q^. 

The following theorem was proved in [35]; it shows that as long as x and Q are 
such that the state sequence u = {u n }^L Q remains bounded, a golden ratio encoder 
(corresponding to the recursion (1.12)) will produce a bitstream (bj) corresponding 
to a beta-expansion of x in base (3 = 0, just as does the beta-encoder from which 
the GRE was derived. 

Theorem 1.2.3 (Daubechies, Giintiirk, Wang, Yilmaz). Consider the recursion 
(1.12). Suppose the 1-bit quantizer Q which outputs bits (bj) in { — 1, 1} is of the type 
Q u a such that the state sequence u = {-u n }^L with u = x and U\ = is bounded. 
Then 

-5>n<T"| <r N+1 (1-15) 



N 

\.r 

n=0 



Here is the golden ratio. 
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Proof. Note that 

N 

?1=0 



The second to last equality uses the relation 1 + <fi — <fi 2 = 0, and the last equality 
is obtained by setting u = x and u± — 0. Since the state sequence u = {M n }™ is 
bounded, it follows that x = Y^=o bn4>~ n ■ Thus, 

N oo 

\x-j2b n r n \ = i h ^~ n \ 

n=0 n =N+l 
0-(iV+l) 

- 1 -(f)" 1 

= <P~ N+l . 



□ 

Although the implementation of GRE requires 2 more adders and one more delay 
element than the implementation of the beta-encoder, the multiplier element a in 



N 



n=0 

N 



U n +1 - M n+2 )0^ 
N+l 



N+2 



n=0 



n=l 



n=2 



N 

■1 , \ ' -/; 



U + (1 + 0- 

n=2 

+U 1 ((j)- 1 + 1) + (j)~ N (u N+1 (l - 0) - U N+2 ^j 

u + _Ar ^Ar + i(l -(f)) - U N+ 2^j 

X + 0"^ (u N+1 (l -(f))- U N+2 ) ■ 
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GRE does not have to be precise (see section 6), whereas imprecisions in the mul- 
tiplier element (3 of the beta-encoder result in beta-expansions of the input x in a 
different base (5' . 

1.2.3 GRE: A revised scheme incorporating integrator leak 

In modeling the golden ratio encoder by the system (1.12), we are assuming that 
the delay elements used in its circuit implementation are ideal. A more realistic 
model would take into account the effect of integrator leak, which is inevitable in 
any practical circuit implementation (see [48] for more details). After one clock 
time, the stored input in the first delay is reduced to Ai times its original value, 
while the stored input in the second delay is replaced by A2 times its original value. 
In virtually all circuits of interest, no more than 10 percent of the stored input is 
leaked at each timestep; that is, we can safely assume that Ai and A2 are param- 
eters in the interval [.9,1]. The precise values of these parameters may change in 
time; however, as virtually all practical A/D converters produce over 1000 bits per 
second (and some can produce over 1 billion bits per second), we may safely assume 
that Ai and A 2 are constant throughout the quantization of at least every 10 samples. 

Fixing an input value x G [—1,1], we arrive at the following revised description 
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of the golden ratio encoder (revised GRE): 



u = 

Ui — X 

for n > : b n = Q(\i\ 2 u n , Xiu n+1 ) 

u n +2 = AiA 2 m„ + Xiu n+ i - b n (1-16) 

Obviously, Ai = A 2 = 1 corresponds to the original model (1.12). It is reason- 
able to assume in practice that (Ai,A 2 ) G M := [.95, l] 2 , and in virtually all cases 
(Ai,A 2 ) EV :=[.9,1] 2 . 

We will show that the revised scheme (1.16) still produces beta-expansions of the 
input x, but in a slightly different base 7 = f3~ l = ~^>x~^~^ > which increases 
away from as the parameters Ai and A 2 decrease. Key in the proof of Theorem 
1.15 was the use of the relation 2 — <fi — 1 = to reduce Yln=o( u n + u n+i — u n+ 2)<fi~ n 
to the sum of the input x, and a remainder term that becomes arbitrarily small with 
increasing N. Accordingly, the relation 1 — A17 — A X A 2 7 2 = gives J2n=o(^^ u ^ + 
Xiu n+ i — u n+ 2)l n+1 = x + R(N), where R(N) goes to as iV goes to infinity. 

Theorem 1.2.4. Suppose the 1-bit quantizer Q in (1.16) of type (1.14) is such that 
the state sequence u = {u n }^L Q with uq = and u\ — x is bounded. Consider 

2A1A2 

k-Elo^7 n+1 |<C 77 ^ 



7 = =^gp^ , T/.en 



where = t^— . 

y 1— 7 
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Proof. 



N 



N 
n=0 



,n+l 



n 



+i - u n+2 )i 



,n+l 



n=0 



= (\ 1 \ 2l 2 + \ ll -l)Y / l n U 



n=2 

+AiA 2 m 7 + ui(Ai7 + AiA 2 7 2 ) 



= X + 7 iY (M A r(Ai -7 X ) ~U N+1 




The last equality is obtained by setting tt = and = x. Since the u n are bounded, 
it follows as in the proof of (1.12) that 



Theorem 1.2.4 implies that if 7 is known, then the revised GRE scheme (1.16) 
still gives exponential approximations to the input signal x, provided that the u n are 
indeed bounded. The following theorem gives an explicit range for the parameters 
{y,ot) which results in bounded sequences u n when the input x G [—1,1], indepen- 
dent of the values of the leakage parameters (Ai,A 2 ) in the set V = [.9, l] 2 . This 
parameter range is only slightly more restrictive than that derived in [35] for the 
ideal GRE scheme (1.12); that is, the admissable parameter range for a and v is 
essentially robust with respect to leakage in the delay elements of the GRE circuit 




□ 



33 



implementation. 

Theorem 1.2.5. Let x G [-1,1], and (Ai,A 2 ) G [.9,1] 2 . Suppose that the GRE 
scheme (1.16) is followed, and the quantizer Q^(u,v) is used, with v possibly varying 
at each occurrence, but always satisfying v < e for some fixed tolerance e < .337. If 
the parameter a takes values in the interval [1.198 + 1.479e, 2.053 — 1.058e] 7 then the 
resulting state sequence (uj)j e N is bounded. 

We leave the proof of Theorem 1.2.5 to the appendix. This result in some sense 
parallels Theorem 1.2.1 in that an admissable range for the "multiplication" param- 
eter (a in GRE, (3 in beta-encoders) is specified for a given quantizer tolerance e; 
however, we stress that the specific value of (3 is needed in order to recover the input 
from the bitstream (bj) in beta-encoders. In contrast, a GRE encoder can be built 
with a multiplier a set at any value within the range [1.198 + 1.479e, 2.053 — 1.058e], 
and as long as this multiplier element has enough precision that the true value of 
a will not stray from this interval, then the resulting bitstreams (bj) will always 
represent a series expansion of the input x in base 7 of Theorem 1.2.4, which does 
not depend on a. The base 7 does however depend on the leakage parameters Ai 
and A2, which are not known a priori to the decoder and can also vary in time from 
input to input: as discussed earlier, the only information available to the decoder a 
priori is that (Ai, A 2 ) G [.9, l] 2 in virtually all cases of interest, and (Ai, A 2 ) G [.95, l] 2 , 
or 7 G [0 _1 , f§0 -1 ] ~ [-618, .6505] in most cases of interest. In the next section, we 
show that the upper bound of .6505 is sufficiently small that the value of 7 can be 
recovered with exponential precision from the encoded bitstreams of a pair of real 
numbers (x, —x). In this sense, GRE encoders are robust with respect to leakage in 
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the delay elements, imprecisions in the multiplier a, and quantization error. 

1.2.4 Determining 7 = 1/(3 up to exponential precision 

Approximating 7 using an encoded bitstream for x = 

Recall that by Theorem 1.2.2, exponentially precise approximations 7 to the root 7 
in Theorem 1.2.4 are sufficient in order to reconstruct subsequent input x n = f(t n ) G 
[—1,1] whose bit streams are expansions in root 7 with exponential precision. In this 
section, we present a method to approximate 7 with such precision using the only 
information at our disposal at the decoding end of the quantization scheme: the 
encoded bitstreams of real numbers x G [—1, 1]. More precisely, we will be able to 
recover the value 7 = using only a single bitstream corresponding to a beta- 
expansion of the number 0. It is easy to adapt this method to slow variations of 7 
in time, as one can repeat the following procedure at regular time intervals during 
quantization, and update the value of 7 accordingly. 

The analysis that follows will rely on the following theorem by Peres and Solomyak 
[61]: 

Theorem 1.2.6 (Peres-Solomyak). [S-transversality] Consider the intervals I p = 
[0, p]. If p < .6491..., then for any g of the form 

00 

g(x) = l + J2 b i xj > h i G {-!> °> ( L17 ) 
and any x G I p , there exists a 5 P > such that if g(x) < 5 p then g\x) < —5 p . 
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Furthermore, as p increases to .6491..., 5 p decreases to 0. 

Theorem 1.2.6 has the following straightforward corollary: 

Corollary 1.2.7. If g{x) is a polynomial or power series belonging to the class B 
given by 

oo 

B = {±l + J2 b J xj > -bj E {-1,0,1}} (1.18) 

3=1 

then g can have no more than one root on the interval (0, .6491]. Furthermore, if 
such a root exists, then this root must be simple. 

In [61], Peres and Solomyak used Theorem 1.2.6 to show that the distribution 
v\ of the random series J^±A n is absolutely continuous for a.e. A G (1/2,1). The 
estimates in Theorem 1.2.6 are obtained by computing the smallest double zero of a 
larger class of power series B := {1 + Y^=i b n x n '■ b n G [—1,1]}. The specific upper 
bound p = .6491... for which 5-transversality holds on the interval I p is the tightest 
bound that can be reached using their method of proof, but the true upper bound 
cannot be much larger; in [68], a power series g(t) belonging to the class B in (1.18) 
is constructed which has a double zero at t pa .68 (i.e., g(t ) = and g'(t ) = 0), 
and having a double root obviously contradicts <5-transversality. 

We now show how to use Theorem 1.2.6 to recover 7 from a bitstream (bj)jL 1 pro- 
duced from the GRE scheme (1.16) corresponding to input x — 0. We assume that 
the parameters (a, v) of the quantizer Q v a used in this implementation are within 
the range provided by Theorem 1.2.5. Note that such a bitstream (bj)fL l is not 
unique if v > 0, or if more than one pair (Ai,A 2 ) correspond to the same value of 
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7. Nevertheless, Theorem 1.2.4 implies that Y^jLi ^fi* = 0, so that 7 is a root of 
the power series F(t) — b± + J2JLi bj+it j . Suppose we know that /3 > 1.54056, or 
that 7 < .6491. Since F(t) belongs to the class B of Corollary (1.2.7), 7 must nec- 
essarily be the smallest positive root of F(t). In reality one does not have access 
to the entire bitsream {bj)^ =1) but only a finite sequence (bj)jL v It is natural to ask 
whether we can approximate the first root of F(t) by the first root of the polynomials 
Pn(t) — bi + YTj=i bj+iV ■ Since the P n still belong to the class B of Corollary (1.2.7), 
|P n (t)| has at most one zero on the interval .6491] m [.618, .6491]. The following 
theorem shows that, if it is known a priori that 7 < .6491 — e for some e > 0, then for 
n sufficiently large, |P n (0l * s guaranteed to have a root j n in [0, .6491], and (7 — 7„| 
decreases exponentially as n increases. 

Theorem 1.2.8. Suppose that for some e > 0, it is known that 7 < jhigh — -6491 — e. 
Let 5 > be such that 5-transversality holds on the interval [0,7^^]. Let N be the 
smallest integer such that j N+1 < (1 — j)eS. Then for n> N, 

(a) P n has a unique root 7„ in [0, .6491] 

(b) I7 - 7„| < Ci7 n ; where d = ^z^y- 

Proof. Without loss of generality, we assume bi — 1, and divide the proof into 2 
cases: (1) P n {l) < 0, and (2) P n {j) > 0. The proof is the same for b\ = —1, except 
that the cases are reversed. 

Case (1): Suppose P n {l) < 0. In this case, many of the restrictions in the theorem 
are not necessary; in fact, the theorem holds here for all n > and e > 0. P n (t) has 
opposite signs at t — and t — 7, so P n must have at least one root 7„ in between. 
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Moreover, this root is unique by Theorem 1.2.6. To prove part (b) of the theorem, 
observe that Theorem 1.2.6 implies that if P n (to) < 5 at some t £ (0, .6491), then 
P' n {t) < -5 in the interval [t ,.6491). In particular, P n {t) < 5 and P' n {t) < -5 for 
t e [ 7n , 7 ]. By the Mean Value Theorem, |P n ( 7 )| = |P n ( 7 ) -P n ( 7 n)| = |i*(0ll7-7n| 
for some £ G [ 7n ,7], so that 



7-7n 



< 

< 
< 





^n(7)l 


mf ee[ 7 n,7] 


^(01 



1^.(7) I 



The inequality |P n (7)| < 77^7 follows from Theorem 1.2.4. 

Case (2): If P n ( 7 ) > 0, then Theorem 1.2.6 implies that the first positive root of 
P n , if it exists, must be greater than 7. Whereas in Case (1), P n was guaranteed 
to have a root 7n < 7 for all n > 0, in this case P n might not even have a root in 
(0, 1); for example, the first positive root of the polynomial P±{t) = 1 — t — t 2 + t 3 
occurs at t — 1. However, if n is sufficiently large, P n will have a root in [7, .6491]. 
Precisely, let n > N, where N is the smallest integer such that >y N+1 < (1 — 7 )e<5. 
Then P n ( 7 ) < 737- < 77^7 < e5 < <5. Since 7 + e < .6491, Theorem 1.2.6 gives that 
Pn(t) < S for t G [7, 7 + e], and 



Pn(7 + e) < ^n(7) + e sup i*(t) 

te[7,7+«] 



0. 
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We have shown that P n {l) > and P„(7 + e) < 0, so P n is guaranteed to have a root 
7„ in the interval [7,7 + e], which is in fact the unique root of P n in [0, .6491]. 
Furthermore, the discrepency between 7„ and 7 becomes exponentially small as 
n > N increases, as 

I in J\ inf^r.. , IP' mi — 



inf ee[7,7n] - (1-7)5 • 

□ 

Remark 1.2.9. In [61] it is shown that S-transversality holds on [0, .63] with 5 = 
8.63 = -07. This interval corresponds to e = .6491 — .63 = .0191 in Theorem 1.2.8. If 
7 is known a priori to be less than .63, then < ~ 1-7027, and N of Theorem 
(1.2.8) satisfies 

jy log e<5-log 1.7027 jg 
— log .63 — 

Of course, even if we know P n (t), n will be too large to solve for 7„ analyti- 
cally. However, if we can approximate 7„ by % with a precision of 0(7"), then 
1 7 _ 7n| < 1 7 — 7n| + |7n — 7n| will also be of order 7™, so that the estimates % are 
still exponentially accurate approximations to the input signal. 

Indeed, any % e .6491 — e] satisfying |P n (7n)| < 0~ n provides such an exponen- 
tial approximation to 7„. Since n > N, it follows that 0~ n < 7™ < 7^ < < - 1 ~ 7 - >£<5 < 5, 
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and so P' n {t) < —5 on the interval between 7„ and 7„. Thus, 

|7-7n| < 17 ~7n| + |7n ~ 7n| 

|^n(7n)| 



< C l7 " + 7 



in f^e[7n,7n]o^e[7u,7n] 1^(01 



5 



< 5 

= C 27 n , (1.19) 



where C 2 = C 1 + I 



5(1-7) • 

Approximating 7 with beta-expansions of (— x, x) 

The method for approximating the value of 7 in the previous section requires a 
bitstream 6 corresponding to running the GRE recursion (1.16) with specific input 
x — 0. This assumes that the reference value can be measured with high precision, 
which is an impractical constraint. We can try to adapt the argument using bit- 
streams of an encoded pair (x, —x) as follows. Let b and c be bitstreams correspond- 
ing to x and —x, respectively. Define dj = bj + Cj. Put k = min{j G N\dj + i 7^ 0}, 
and consider the sequence d— (dj)jLi defined by dj = \dj + k- Since dj G {—2,0,2}, 
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it follows that dj G { — 1, 0, 1}, and by Theorem 1.2.4, we have that 

oo 1 oo 

n=l n=l 

oo oo 

n=l n=l 

= (1.20) 



so that 

N 

\YsdnY l \<C^ N , (1.21) 

n=l 

where the constant C-y = t^ - . 

i 1— 7 

Equation (1.21), along with the fact that the polynomials P/v(i) : = ^i+X^Li dj+iP 
are of the form (1.18), allows us to apply Theorem 1.2.8 to conclude that for iV suf- 
ficiently large, the first positive root 7a? of Pn becomes exponentially close to 7. 
However, note that the encoding of (x, —x) is not equivalent to the encoding of 0. 
The value of k — min{j e N\dj +1 ^ 0} used to define the sequence d can be arbi- 
trarily large. In fact, if an ideal quantizer Q° a is used, then the bitstreams b and c 
are uniquely defined by b = — c, so that d = 0. Thus, this method for recovering 7 
actually requires the use of a flaky quantizer Q v a . To this end, one could intentially 
implement GRE with a quantizer which toggles close to, but not exactly at zero. 
One could alternatively send not only a single pair of bitstreams {b, c), but multiple 
pairs of bitstreams (b l ,c l ) corresponding to several pairs (xi,—xi), to increase the 
chance of having a pair that has b LJ + c LJ 7^ for relatively small j. 
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Figure 1.2 plots several instances of P s ,Piq, and P 32 , corresponding to 7 = .64375. 
The quantizer Q v a {u,v) is used, with a = 2 and v = .3. These values of a and v 
generate bounded sequences (u n ) for all (Ai, A 2 ) G [.9, l] 2 by Theorem 1.2.5. Figure 
1.3 plots several instances of the same polynomials, but with root 7 = .75. 

As shown in Figure 1.1, numerical evidence suggests that 10 iterations of New- 
ton's method starting from xq = (\T X ~ .618 will compute an approximation to 7^, 
the first root of P N , with the desired exponential precision. The figure plots 7 versus 
the error I7 — 7jv|, where 7^ is the approximation to 7^ obtained via a 10- step 
Newton Method, starting from xq = .618. More precisely, for each N, we ran 100 
different trials, with x and 7 picked randomly from the intervals [—1, 1] and [.618, .7] 
respectively, and independently for each trial. The worst case approximation error 
of the 100 trials is plotted in each case. The quantizer used is Q^(u,v) with v = .3, 
and a picked randomly in the interval [1.7,2], independently for each trial. Again, 
Theorem 1.2.5 shows that these values of a and v generate bounded sequences (u n ) 
for all (Ai,A 2 ) G [.9, l] 2 . 

1.2.5 The beta-encoder revisited 

Even though our analysis of the previous section was motivated by leaky GRE en- 
coders, it can be applied to general beta-encoders to recover the value of (3 at any 
time during quantization. From the last section, we have: 

Theorem 1.2.10. Let F(t) = 'Y^L^bjP be a power series belonging to the class 
B = {±l + J2^L 1 bjX J , bj e {-1,0,1}}. Suppose that F has a root at 7 G [yiow,lhigh\, 
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where jhigh = -6491 — e for some e > 0. Let 5 > be such that 5-transversality holds 
on the interval [0,^high\- Let N be the smallest integer such that j N+1 < e5(l — 7). 
Then for n>N, 

(a) The polynomials P n (t) = X}j=oM' ? have a unique root j n in [0, .6491] 

(b) Any 7 G [^i ow , Ihigh] which satisfies \P n {l)\ < (liow) n also satisfies (7 - 7] < 
C\^\ n , where C = 

This theorem applies to beta encoders, corresponding to implementing the re- 
cursion (1.7) with flaky quantizer Q v defined by (1.9), and with 7 = (3~ l known a 
priori to be contained in an interval [jiow, Ihigh}- If /3 > 1.54059, or jhigh < .6491, 
then we can recover 7 = from either a bitstream corresponding to 0, or a pair 
of bitstreams (x, —x), using Theorem 1.2.10. Of course we should not consider only 
the scheme (1.7), but rather a revised scheme which accounts for integrator leak 
on the (single) integrator used in the beta-encoder implementation. The revised 
beta-encoding scheme, with slightly different initial conditions, becomes 

Ui — x 

h = Q(u!) 

for j > 1 : Uj + i = \(3(uj — bj) 

b j+1 = Q(u j+1 ) (1.22) 

where A is an unknown parameter in [.9, 1]. As long as $ — \j3 > 1, we still have 
that I a; — X^Io^+i7 l | < Cpfi~ N where (Tg = j—^] furthermore, we can use Theorem 
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1.2.10 to recover (3 = A/3 in (1.22), although the specific values of A and /3 cannot be 
distinguished, just as the specific values of Ai and A2 in the expression for 7 could 
not be distinguished in GRE. 




5 10 15 20 25 30 5 10 15 " 20 25 30 

N N 



(a) (b) 

Figure 1.1: (a) iV versus I7 — 7jv|, where is obtained via a 10 step Newton Method 
approximation to the first root of the polynomial Pjv starting from x = -618. For 
each N, (7 — 7jv| is the worst-case error among 100 experiments corresponding to 
different ( and different 7, chosen randomly from [.618, .7]. (b) N versus 

\x — xn\, where xjy is reconstructed from j N using the first N bits. The quantizer 
used in this experiment is Q v a with v — .3 and a chosen randomly in the interval 
[1.7,2]. 



Remarks 

Figure 1.1 suggests that the first positive roots of the P n serve as exponentially 
precise approximations to 7 for values of 7 greater than .6491; Figure 1.3 suggests 
that the first positive root of P n will approximate values of 7 up to 7 = .75. Further- 
more, these figures suggest that the constants C\ and C 2 of (1.19) in the exponential 
convergence of these roots to 7 can be made much sharper, even for larger values 
of 7. This should not be surprising, considering that nowhere in the analysis of the 
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previous section did we exploit the specific structure of beta-expansions obtained 
via the particular recursions (1.22) and (1.12), such as the fact that such sequences 
(b n ) cannot contain infinite strings of successive l's or -l's. It is precisely power 
series with such infinite strings that are the 'extremal cases' which force the bound 
of .6491 in Theorem 1.2.6. It is difficult to provide more refined estimates for the 
constants C\ and C 2 of (1.19) in general, but in the idealized setting where beta- 
expansions of are available via the ideal GRE scheme (1.12) without leakage, or via 
the beta-encoding (1.12) with f3 — </>, the beta-expansions of have a very special 
structure: 

Proposition 1.2.11. Consider the ideal GRE recursion (1.12) with input u — U\ — 
and Q = Q u a , or the beta-encoder recursion (1.7) with (5 = <fi, uq = 0, and Q = Q u . 
As long as a > v in (1.12), or v < 1 in (1.7), then for each j G {0, 1, ...}, b 3 j is equal 
to -1 or +1, and b 3j+1 = b 3j+2 = —b 3j . 

The proof is straightforward, and we omit the details. 

Proposition 1.2.11 can be used to prove directly that 7 must be the first positive 
root of the polynomials P/v(t) = 61 + Y^=i bj+it j , when 7 = 0" 1 in either (1.12) or 
(1.7). Indeed, P N can be factored as follows: 

N 

PnW = ^{hj+i - hj+it - b 3j+ it 2 )t 3j 
3=0 

= (1 - 1 - 1 2 ) 

i=i 

= (1 -t-t 2 )R N (t) (1.23) 
45 



where Rjy is a polynomial with random coefficients of the form X^=o '• 

-P/v(i) = (1 — t — t 2 )R N (t) clearly has a root at t — and this root must be 
the only root of P/v on [O,0 -1 ), since on this interval Rn is bounded away from 
by \R N (t)\ > 1 - > 1 - ~ -691. We can also obtain a lower bound on 

mOT 1 )! = -(1 + 20- 1 )i? JV (0- 1 ) by IP^^ 1 )! > |1 + 20- 1 !!! - > 1.545. 

Note that this bound holds also for the infinite sum F(t) — b\ + Y^jLibj+iP , and 
that the bound of |P'(7)I > 1-545 is much sharper than the bound on |P'(7)| given 
by Theorem 1.2.6; e.g., 5 63 = .07, and <5 649 i = .00008 (see [61]). Similar bounds on 
the derivatives \P'n{i)\ and 1^(7)1 corresponding to beta-expansions of in a base 
7 close to 0" 1 should hold, leading to sharper estimates on the constants C 1 and C 2 
of (1.19) in the case where beta-expansions of are available. 

1.2.6 Closing remarks 

We have shown that golden ratio encoders are robust with respect to leakage in the 
delay elements of their circuit implementation. Although such leakage may change 
the base 7 in the reconstruction formula \y — ^2f =1 bj^\ ~ 0(j N ) , we have shown 
that exponentially precise approximations 7 to 7 can be obtained from the bitstreams 
of a pair (x, —x), and such approximations 7 are sufficient to reconstruct subsequent 
input y by \y - Y% =1 bjl j \ ~ 0(7^). 

Our method can be extended to recover the base (5 in general beta-encoders, as 
long as (5 is known a priori to be sufficiently large; e.g. f3 > 1.54. This method 
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is similar to the method proposed in [33] for recovering /3 in beta-encoders when 
{0, l}-quantizers are used, except that our method does not require a fixed reference 
level, which is difficult to measure with high precision in practice. 




(a) (b) (c) 

Figure 1.2: The graphs of (a) P$, (b) Pxq, and (c) P32 for several pairs (xj, —Xj) 
corresponding to 7 = .64575. These polynomials can have several roots on the unit 
interval, but the first of these roots must become exponentially close to 7 as N 
increases. The quantizer used is Q u a with parameter values v = .3, and a = 2. 

1.2.7 Appendix: proof of Theorem 1.2.5 

In this section we prove Theorem 1.2.5, which provides a range within which the 
"flakiness" parameter v and the "amplifier" parameter a in the quantizer Q^(u,v) 
can vary from iteration to iteration, without changing the fact that the sequences 
(u n ) produced by the scheme (1.16) will be bounded. The derived range for v and a 
is independent of the specific values of the parameters (Aj., A2) G [-9, l] 2 in (1.16). 

The techniques of this section are borrowed in large part from those used in [35] 
to prove a similar result for the ideal GRE scheme (1.12); i.e., taking Ai = A2 = 1 
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(a) (b) (c) 

Figure 1.3: The graphs of (a) P$, (b) Pie, and (c) P32 for several pairs (xj, —Xj) 
corresponding to 7 = .75. These graphs suggest that the first positive root of these 
polynomials becomes exponentially close to 7 as N increases, although we do not 
have a proof of this result for values of 7 greater than approximately .65. The 
quantizer used is Q u a with parameter values v = .3, and a = 2. 
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Figure 1.4: The rectangle R = A*B 2 CfD l (bold outline) is a positively invariant 
set for the map T£. In fact, T£(R) is contained in the smaller rectangle ABCD 
(dashed outline), illustrating that the revised GRE scheme (1.16) is robust with 
respect to small additive errors. Here, $1 and $2 are the eigenvectors of the matrix 
in A in (1.28) below. In this figure, A2 = .6 and /1 is taken to be .0625. The length 
parameters l,r,h, and d will be discussed later. 
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in (1.25). As is done in [35], we first observe that the recursion formula (1.16) cor- 
responding to the leaky GRE is equivalent to the following piecewise affine discrete 
dynamical system on IR 2 : 



u n +i 


_ rpv 


Un 






a 




U n +2 




U n +l 



(1.24) 



where 





U 




1 




u 







rpv . 
1 a ■ 


V 




A1A2 Ai 




V 


- Q&(u,v) 


1 



where a = and v = We will construct a class of subsets R = R(n) = 

i?(Ai, A 2 , n) of E 2 for which T»(R) + B M (0) C R, where B^O) is the disk of radius fi 
centered at the origin; i.e. -B M (0) = {(u,v) : u 2 + v 2 < fi 2 }. Note that these sets are 
not only positively invariant sets of the map T£, but have the additional property 
that if (uo,Vo) G the image (u n+ i,v n+ i) = T^(u n ,v n ) may be perturbed at 

any time within a radius of ji (for example, by additive noise), and the the resulting 
sequence (w n )n=o will still remain bounded within i?(/x) for all time n. 

We refer the reader to Figure 1.4 as we detail the construction of the sets R(fi). 
Rectangles A\B\C\D\ and A2B2C2D2 in Figure 1.4 are designed so that their re- 
spective images under the affine maps T\ and T 2 (defined below) are both equal to 
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the dashed rectangle ABCD. 
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->■ 
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A1A2 


Ai 




V 




1 



;i.26) 



That is, rectangles A 1 B 1 CiD 1 and A 2 B 2 C 2 D 2 will satsify T X {A X B X C X D^) = ABCD = 
T 2 {A 2 B 2 C 2 D 2 ). More specifically, T X (A X ) = T 2 (A 2 ) = A, T X (B X ) = T 2 (B 2 ) = B, 
and so on. Since the rectangle R = Af B 2 C 2 D\ is contained within the union of 
A\BxC\D\ and A 2 B 2 C 2 D 2 , R is a positively invariant set for any map T(u,v) satis- 
fying 

Ti(u, v), (u, v)eR\ A 2 B 2 C 2 D 2 

T(u,v) = \ T 2 (u,v), (u,v)eR\A 1 B 1 C 1 D 1 (1.27) 

Ti(u,v) or T 2 (u,v), (u,v) G if 
where if = A^B^O^Dx fl A 2 B 2 C 2 D 2 . In particular, if the parameters a and z/ in the 
map are chosen such that the intersection of i? and the strip F — {(u, v) : — i> < 
■u + cif < z>} is a subset of if (recall a = a/\ 2 and z> = i//(AiA 2 ), then is of the 
form (1.27). Indeed, F is the region of the plane in which the quantizer Q~(w,f) 
operates in flaky mode. This geometric setup is clarified with a figure, provided in 
Figure 1.2.7. 



It remains to verify the existence of at least one solution to the setup in Figure 1.4 
for each (Ai, A 2 ) G [.9, l] 2 . This can be done, following the lead of [35], in terms of 
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Figure 1.5: The flaky quantizer Q^(u,v) outputs either —1 or +1 when the input 
(u,v) belongs to the strip F — {(u,v) : u < u + av < u} (shaded above). Here 
fi = .0625, a = 2, and v = .15. For these parameter values, the intersection of 
F and R is a subset of AiBiC\Di R A2B2C2D2, meaning that the scheme (1.16) 
produces bounded sequences iu n ), when implemented with this quantizer, and with 
input (u , Mi) G R. 
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the parameters defined in Figure 1.4. 



Note that the matrix 



1 

AiA 2 Ai 



(1.28) 



in (1.26) has as eigenvalues ei = x, - + \ /x i +4 X } x ^ anc j _ e2 _ — ( x ^-+\ /x i^ iXiX2 y j n 
particular, when Ai = A2 = 1, we have that e\ = ~ 1.618 and £2 = ~ .618. 
The eigenvalues ei and — e 2 have respective normalized eigenvectors 



$1 = sr 1 



1 

ei 



and $2 



1 

£2 



where s\ = a/1 + e\ and s 2 = \A + e 2- It follows that the affine map Ti acts as 
an expansion by a factor of e x along $! and a reflection followed by contraction by 
a factor of £2 along $ 2 , followed by a vertical translation of +1. T 2 is the same as 
Ti except with a vertical translation of -1 instead of +1. After some straightfor- 
ward algebraic calculations, the mapping relations described above imply that the 
parameters in Figure 1.4 are given by 



h = 

d = 

I = 

r = 



2(JL 



2s, 



1-e! ei(ei - l)(ei + e 2 ) 

H , si(2 ~ £i) 

1-ei ei(ei - l)(ei + £2) 

/i s 2 (2 - e 2 ) 



1-62 e 2 (l - e 2 )(ei + e 2 ) 



+ 



S 2 



1-62 (1 - e 2 )(ei + e 2 ) 



(1.29) 
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The existence of the overlapping region H = A 1 B 1 C 1 D 1 fl A2B2C2D2 is equivalent to 
the condition d > which in turn is equivalent to the condition jj, < ^t^^\ ■ This 
expression is minimized over the range (Ai, A2) G V = [.9, l] 2 when Ai = A2 = 1, in 
which case this constraint simplifies to /j, < ^p^=[ ~ -2008. 

We are interested primarily in quantizing numbers in [—1, 1], so that this is the range 
of interest for the input U\ = x; for uq we simply take 0. The set {0} x [—1, 1] C 
D(Ai A 2 )ev -^(^i) A2, A*) is equivalent to the condition that the line v = mu + b which 
passes through the points B and C in Figure 1.4 have y-intercept b > 1 for all 
(Ai, A 2 ) G V. This line has slope m = — £2, and passes through the point C, so that 
its y-intercept is given by b = s^ 1 (e 1 + e 2 )(h — d). Rearranging terms, this implies 
that b > 1 if and only if 

< (yTTgK^). (L30) 

e 1 + e 2 



The right hand side of the above inequality is bounded below over the range (Ai, A 2 ) G 
V by V 1 +(- 9 J 2 ( 2 -^) w 301; so that indeed | } x c _R(A X , A 2 ,/i) is satisfied 

independent of (Ai, A 2 ) G V, for all admissable /x G [0, .2008]. 



In practice, the "flaky" parameter v of the quantizer Q u a is not known exactly; we 
will however be given a tolerance 5 for which it is known that \u\ < 5. It follows that 
for each 5, we would like a range {o. m i n , a max ) for the "amplifier" a such that the 
GRE scheme (1.16), implemented with quantizer Q v a , produces bounded sequences 
(u n ) for all values of (Ai, A 2 ) G V. Now, it will be easier to derive all of the following 
results for a and z>, and return to the original parameters (a, v) afterwards. For a 
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particular choice of (Ai,A 2 ), it is not hard to derive an admissable range for a in 
terms of the eigenvalues e\ and e 2 for fixed 5. We will take \x = in the following 
analysis for the sake of simplicity. First of all, the tolerance 8 must be admissable, 
i.e. the coordinate (5,0) should lie within the region H. Indeed, for v G [0, 6], a 
can vary within a small neighborhood of ^, as long as the line with slope — i which 
passes through (0,6) remains bounded above in the region H by the line passing 
through Bi and C\. In other words, the admissable range for a is obtained from 
the constraint mi < -i < m 2 , where mi is the slope of the line through the points 
{(5, 0), Ci} and m 2 is the slope of the line through the points {(6, 0), Bf} in Figure 
1.2.7. Rewriting C\ and Bf in terms of the eigenvalues e± and e 2 via the relations 
(1.29), we have that 

L(e 1 ,e 2 ) < a < U{e 1 ,e 1 + e 2 ) (1.31) 

where 

L(x,y) = N(x,y)/D{x,y) 

x(x - 1) - (2 - x)(l - y) + Sxjx - l)(x + y)(l - y) 
x((2-x)(l-y)+y(x-l)) 

and 

u(x, y) = z + -y-zy-^-m-y + *)_ (L32) 

It is not hard to show that for any fixed y G [.90 + .90- 1 , <p + <p' 1 } w [2.0124, 2.236], 
the function U(x,y) increases as a function of x over the range x G [.90,0] ~ 
[1.456,1.618], as long as v < .75. It follows that the minimum of U(x,y) over 
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the rectangle S = [1.456, 1.618] x [2.0124, 2.236] occurs along the edge correspond- 
ing to x = 1.456. But £7(1.456, y) decreases as a function of y over the interval 
[2.0124,2.236], so that the minimum of U(x,y) over the entire rectangle S occurs 
at (x,y) = (1.456,2.236), giving the following uniform lower bound on a max for 
(Ai,A 2 )ey : 

a max > £7(1.456,2.236) 

« 2.281 - .9525. (1.33) 

We proceed in the same fashion to derive a uniform upper bound on a m in over 
(Ai,A 2 ) G V, except that we analyze the numerator and the denominator in the 
expression for L(x,y) separately. The numerator N(x,y) increases as a function of 
x over the range x G [1.456, 1.618] for fixed y G [.90 _1 , « [.5562, .618], so that 
N(x,y) achieves its maximum over the rectangle T = [1.456,1.618] x [.5562, .618] 
along the edge corresponding to x — 1.618. N (1.618, y) increases as a function of 
y over the range y G [.5562, .618], so that N(x,y) achieves its maximum over T 
at (x,y) = (1.618, .618). A similar analysis shows that the denominator D(x,y) is 
minimized over T at (x,y) = (1.456, .618). It follows that for (Ai, A 2 ) G [.9, l] 2 , a min 
is bounded above by 

a min < JV(1.618,.618)/D(1.456,.618) 

w 1.198(1 + 5) (1.34) 

Finally, we note that 5 is admissable for all (Ai,A 2 ) if and only if L(ei,e 2 ) < 
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U(ei,e 1 + e 2 ) for this value of 5 and for all coresponding e\ and e 2 . Thus an ad- 
missable range for 5 is obtained by equating the lower bound of 2.281 — .9525 in 
(1.33) and the upper bound of 1.198(1 + 5) in (1.34); namely, 5 G [0, .5037]. 



In terms of the original parameters a = A 2 a and v = AiA 2 ^), 

1.198(1 + 5) < a < 2.053 - .85685, and so 
< 5 < .4161, 

v = AiA 2 z> < AiA 2 5 < .337. (1.35) 
represents a uniformly admissible range for (a, v) over the interval (Ai, A 2 ) G [.9, l] 2 . 
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Chapter 2 

Quiet quantization in analog to 
digital conversion 

In the last chapter, we studied robustness properties of the /3-encoder, a quantization 
method in analog to digital conversion that is both robust with respect to quantizer 
imperfections like SA schemes, and also achieves optimal reconstruction guarantees 
like pulse code modulation (PCM). While we were able to show that /3-encoders 
are also robust with respect to other component imperfections, the /3-encoder is not 
robust with respect to additive noise; that is, if the discrete input x n to the /3-encoder 
(1.7) represent noisy signal samples x n = f n + e n , then unavoidably 

N-l 

J2bjr j = fn + e n + 0((3- N ), (2.1) 

3=0 
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and the situation is similar for the golden ratio encoder. That is, no matter how many 
bits are spent per Nyquist interval, the reconstruction error of the beta-encoder and 
golden ratio encoder is limited by the level of noise on the samples f n . 

On the other hand, EA schemes are robust with respect to additive noise. That 
is, the 0(\~ K ) reconstruction accuracy of the Kt\i order EA scheme is preserved 
under additive noise, as facilitated by the incorporation of all previous samples f£, 
k < n, in determining the quantization output b n . However, EA schemes suffer from 
an entirely different kind of instability in the context of audio signal processing, cor- 
responding to a mechanical rather than mathematical reconstruction inaccuracy: at 
the onset of periodicities in the bit output b n , the filters used in the reconstruction 
of the analog signal by EA quantizers are such that periodic oscillatory patterns 
in the quantization output b n generate low-level idle tones that are not present in 
the signal. These tones are particularly intolerable to the listener when no signal is 
present, such as between successive audio tracks. As idle tone components are most 
prominent in low-order, 1-bit EA schemes [5], it is desirable to adapt these schemes 
in such a way as to eliminate the possibility of periodic behavior of the output b n for 
vanishing input x n ; we shall refer to such as quiet EA quantizers, in line with the 
following definition: 
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Definition 2.0.12 (Quiet quantization). A quantization scheme is said to be 
quiet if the onset of identically zero input x n = after some finite time k induces 
constant bit output b n = b m after some later time m > k. 

In the first order SA model (2.2), which, starting from initial state u , follows the 
recursion 



u r . 



QK-l + fn) 
U n -1 + fn ~ K, 



(2.2) 



one can eliminate periodicities in the output b n at instantiation of small input f n by 
simply replacing the standard 2-level quantizer Q(u) = sgn(w) by a tri-level quantizer 

Qtri- 

1 U > T, 

Qlriiu) = \ -r<u<r, (2.3) 

— 1 U < —T. 

It is straightforward that the first-order SA scheme retains its first-order approxima- 
tion accuracy when implemented with the tri-level quantizer Qf ri , and is also quiet, 
in the sense of Definition 2.0.12; however, first order schemes are rarely used in prac- 
tice, and we would thus like a similar result for higher order schemes. Unfortunately, 



59 



the second-order EA scheme, 



n 



Q(F(u n -i, v n -i)) 



u 



'n 



U n -1 + fn ~ b 



n 



V n -1 + U, 



(2.4) 



is not quiet, as shown in [79], even when implemented with tri-level quantizer Q[ ri and 
with linear rule F 7 (u, v) = u + ^v. In fact, for most initial states (u , v ) at the onset 
of zero input = 0, the sequence b n will become oscillatory. An exception to this 
rule occurs for the initial condition (uq,Vq) = (0,0), in which case (u n ,v n ) = (0,0) 
and b n = persist as long as = remains zero. That is, the origin is a fixed point 
of the zero-input discrete map M : (u n ,v n ) — > (u n +i,v n+ i) produced by (2.4) with 
tri-level quantizer and zero input f£ = 0, but is not an attractive fixed point. Recall 
that an attractive fixed point for a discrete map M is a fixed point xq = M(x ) 
having the property that for any value of x in the domain that is close enough to x , 
the orbit of x, 



converges to x . Such a fixed point is said to be a global attracting fixed point if 
M n x — > xq for any value of x in the domain. In order to modify the second-order 
scheme (2.4) so that the origin is an attractive fixed point for the corresponding 
zero-input map, the state sequence (u n ,v n ) must be made to decay. One way to 
induce decay is by applying a contraction (u, v) — > (pu, pv) before each iteration of 



(x,M(x),M(M(x)) =M 2 x,...) 
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(2.4), resulting in the modification 



b n = Q(F(pUn- U pv n -i)) 
u n pu n _i -\- x n b nj 

v n = pv n -x + u n . (2.5) 

This so-called finite-memory scheme was first studied in [79] . As the accuracy of S A 
methods rely on 'keeping track' of the previous input f£, k < n through the sequence 
(u n ,v n ), it is not clear that the finite-memory scheme, which loses a fraction of its 
'memory' at each step, should still be second-order. However, as long as 1 — p 
is sufficiently small, second-order accuracy is maintained, as proven in [79]. For 
completeness, we include this result below, starting with the analogous result for the 
first-order finite-memory scheme, 

b n = Q(pu n ~i + fn), 

U n = PU n -l + fn~ K- (2.6) 

We note that the following result does not depend on the choice of quantizer, and 
requires only that the sequence (u n ) be uniformly bounded. 

Lemma 2.0.13 (Yilmaz). Suppose A > 0, let f e L 2 (IR) be bandlimited with supp 
f C [— n, 7r] and \\f\\oo < 1, an d let g be a function satisfying g — 1 for |£| < n, 
g = for |£| > \n, and g G C°° . Suppose that the leakage factor p > p x = 1 — j, 
and assume that the sequence (u n ) generated by the finite-memory scheme (2.6) is 
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bounded. If (b*) is the output of the first- order finite EA- quantizer (2.6), then 



ii/w -m\\oo<^(\w\w+c 9 ), (2.7) 



where C g < \\g\\ L , + \\\g'\\^, and f(t) = f). 



Proof. We have u n — pu n -\ = — fe^. Therefore, 



f(t)-f(t) = l^J2(Un-U n ^)g(t-j) + (l-p)J2^n-l9(t-j)) 



Using the bound 1 — p < t, 



I/W-/WI < ■ u ^-(J2^ t -j)-3it-— )\ + -J2\g(t- 



□ 



The analogous result for the second-order scheme follows the same argument, and 
we state the result without proof. 

Lemma 2.0.14 (Yilmaz). Let f,g, and p be as in Lemma 2.0.13. Assume that the 
sequence (u n ,v n ) generated by (2.5) is bounded. If (b*) is the output of the second- 
order leaky EA- quantizer given in (2.5), then 



\f(t) - f(t)\ < ^Nl/ll^ + + 2C 9 ), (2.8) 
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where C g and f are as before. 

It is important to note that Lemma 2.0.14 is useful only in the finite regime, even 
though it is asymptotic statement. Indeed, the oversampling ratio A is not taken 
to be arbitrarily large in practice as limited by the resolution of analog-to-digital 
technologies; in most A/D converters, A < 64 is set. For this range of A, the assump- 
tion in Lemma 2.0.14 that the damping factor p A can be prescribed up to resolution 
p G [.97, 1] is certainly reasonable. 

Lemma 2.0.14 rests on the assumption that the initial input (uo,vo) can be pre- 
scribed so that the sequence (u n , v n ) remains uniformly bounded. We can en- 
sure such stability by constructing positively invariant sets for the discrete map 
(u n ,v n ) — > (u n+ i,v n+ i) corresponding to the finite-memory scheme (2.5). Recall 
that a set S is positively invariant for the discrete map M if x G S implies that 
M(x) G S. The following theorem, borrowed again from [79], gives explicit construc- 
tions of such positively invariant sets for the finite-memory map (2.5) with linear 
rule F 7 (u, v ) = u + •yv: 

Theorem 2.0.15 (Yilmaz). Fix a < 1, and constant C satisfying 

c> 1 12 + 9(1 + a) 

~ 2 - 2a 2 8(1 - a) 
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Consider the following functions, 



,2 



B^u) = { Z(1 - a > z (2.9) 



B 2 (u) = < 2(1 + q) + ! C ' M "° (2.10) 
\ 2(&) + 1 - ^ «<0 

and £/ie subset of points in M. 2 lying between the graphs of B\ and B 2 : 

S = {(u,v) :v <B 1 (u),v>B 2 (u)}. (2.11) 

Finally, suppose that 7 in the linear rule F^(u,v) = u + 7-1; is set to a value in the 
range, 

2(2C(1 - « 2 )) 1/2 - (1 + a)} < < 4(1 + a)((l + a) + 2) 



(2C(1 - c^)) 1 / 2 + 2aC " 8C(1 + a) - (1 + a) - 4' 
If (u n+ i, v n+ i) is obtained from (u n ,v n ) G 5 wa £/ie second- order recursion (2.5) 
mpu£ / n satisfying \ f n \ < a, quantizer Q T tri , r < 1/2, and linear rule F^(u n ,v n ), then 
(u n+ i,v n+ i) e S. 

A qualitative depiction of the invariant region S is shown in Figure 2.1 below, as 
generated by admissible parameter values a — .9, C — 40, and 7 = .1. 



2.0.8 The revised scheme: Quiet SA quantization 

Henceforth, we will assume that the oversampling factor A is fixed, and that the 
damping factor p < 1 in the finite-memory scheme (2.5) satisfies the requirements of 
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Figure 2.1: The functions B\ and B 2 from Theorem 2.0.15 are pictured along with the 
lines Li and L 2 consisting of the points (u,v) such that F y (u,v) = .5 and F^(u,v) = 
— .5, respectively. This figure was generated using admissible values (a, C, 7) = 
(.9,40,. 1) 
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Lemma 2.0.14 for the finite-memory scheme to be second-order. We further assume 
that the ioo bound |/ n | < a < 1 is fixed, and that 7 is in the admissible range, as 
specified by Theorem 2.0.15, for the finite-memory SA scheme (2.5) to be stable. 

Unfortunately, the finite-memory scheme (2.5) implemented with tri-level quantizer 
Qf ri is not quiet for p sufficiently close to 1, as a range of initial conditions (u ,v ) 
at the onset of zero input still lead to oscillatory behavior of the output. In Figure 
2.2, we indicate whether or not the iterates (u n ,v n ) generated by the finite-memory 
scheme with zero input x n = remain oscillatory, for a set of perturbations p G [.96, 1] 
and initial conditions (uo,0) G [—2,0] x 0. 

Nevertheless, we show in the following sections that a simple modification to the 
finite-memory scheme (2.5) will ensure quietness, in the sense of Definition 2.0.12. 
The key idea is to apply a damping factor p as in the finite second order scheme 
(2.5), but not at every iteration; only at iterations n for which u n + ^v n > (or the 
symmetric, when u n + ^v n < 0). The situation u n + ^v n > must keep reccurring if 
the system is to remain stable, and the application of an asymmetric damping forces 
the state variables u n , and in turn the variables v n , to zero once /„ = 0. 

More precisely, we will consider the following asymmetrically damped modification 
to (2.5), which, starting from initial (wo,fo), can be implemented according to the 
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n 99l 



P O.E 



0.9751 



0.971 



0.961 



-1.6 -1.4 -1.2 



-0.6 -0.4 -0.2 



Figure 2.2: We indicate whether or not the iterates (u n ,v n ) converge to zero, as 
generated by the second order finite-memory scheme (2.5) with zero input x n = 0, for 
several values of damping factor p G [.96, 1] and initial conditions (uq, 0) G [—2, 0] x 0. 
In particular, (u n ,,v n ) — > can only occur over the black region, as a gray mark 
indicates that after one million iterations n > 1000000, the quantization output q n 
is nonzero \q n \ = 1 at regular time interval not exceeding 100 iterations, for one 
million additional steps n G [1000000,2000000]. The finite-memory scheme (2.5) 



was implemented with parameter p = .2 and tri-level quantizer Q 



1/3 
tri 
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recursion 



(K, Qn) 



V n = 



QKUn-l/lf + Vn-i), 

U n -1 + q n (P - l)ttn-l - K + ft, 

v n -i + q n (p - l)v n -i + u n , 



(2.12) 



with 4-level quantizer, 



Ql(u) 



;-i,o), «<-i/2, 

(0,0), -l/2<u<0, 

(0,1), 0<u<l/(2r), 

(1,1), «>l/(2r). 



(2.13) 



Although we will take r = p for the theoretical analysis of the next section, simula- 
tion results suggest that quietness persists for a much more general class of 4-level 
quantizers that includes the symmetric quantizer Q\. 



That the asymmetric scheme (2.12) maintains second order accuracy for p > p x = 
1 — t is an immediate consequence of Proposition 2.0.14. Furthermore, the asym- 
metric scheme is stable and inherits the positively invariant sets S 7 constructed in 
Proposition 2.0.15. It remains to prove that (2.12) is quiet. Our accomplishments 
are summarized in the following theorem. 

Theorem 2.0.16 (Main theorem). The asymmetrically-damped second- order E A 
scheme (2.12) is quiet when implemented with 4-level quantizer Q P A . That is, if the 
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input (/„) to (2.12) becomes identically equal to zero after some time n > N, then 
the quantization output (h n , q n ) = (0, 1) becomes constant after a finite number of 
additional iterations n > N + k. 

We prove Theorem 3.5.1 by showing that (0, 0) is a global attracting fixed point of the 
zero-input map governing (u n ,v n ) — > (u n+ i,v n+ i) in the asymmetric scheme (2.12). 
As such, we need to develop a better understanding for this dynamical system. 

Observation 2.0.17. The discrete map (u n ,v n ) — > (u n+ i,v n+ i) produced by the 
asymmetrically- damped SA scheme (2.12) with zero input = and J^-level quan- 
tizer Q P A corresponds to the orbit of a piecewise-affine dynamical system, 

(u n ,v n ) = M( 7iP ) («„_!, v n -i) = ^oD^fvi.Vi), (2.14) 



where 



D {l!f)) (u,v) 



(pu,pv), w/7 + v > 0) 
(u,v), else, 



(2.15) 



and 



Aj(u, v) = < 



(u-l,u + v-l), u/'j + v>l/2, 
(u,u + v), \u/^ + v\<l/2, 

(u + l,u + v + l), u/^ + v<-l/2. 



The origin (0, 0) is clearly a fixed point of the map M( 7)P ). We will show that (0, 0) 
is a global attracting fixed point, or that (u n ,v n ) = M," Ju 0: Vo) — > (0,0) for any 
initial condition (u ,v ), in two steps: 
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1. In Section 2.0.9, we construct a positively invariant set T for the map M( 7)P ) 
having the property that all orbits initialized in T converge to the origin. 

2. In Section 2.0.10, we show that T is also a trapping set for M( 7)P ), so that all 
orbits eventually become trapped in T. 

It is not clear that the proof need necessarily be split into two steps; however, nu- 
merical results such as those in Figure 2.5 suggest a marked change in the behavior 
of sequences (x, M( 7;P )X, M^^x, ...) upon entering the invariant set T. 

2.0.9 An invariant set T and convergence to the origin 

We begin by constructing a positively invariant set for the map M( 7jP ). We refer the 
reader to Figure 2.3 for reference. 

Proposition 2.0.18. The union T = T + U T~ of the affine regions 

T+ = {(u,v): < u < I, - 1/2 < uh + v < (1/2 + l/ 7 )}, 
T~ = {(w,u): - 1 < u < 0, - (1/2 + I/7) < m/7 + ^ < V 2 } 

is a positively invariant set for the map M( 7j(t) ). Moreover, the bit sequence b n as 
in (2.12) associated to a sequence Juq,Vq) = (u n ,v n ) contained in T has the 
following alternating structure: 

• Ifb n = 1 then b n+1 e {-1,0} ; 

• Ifb n = -1 then b n+1 e {0, 1}. 
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(a) The positively invariant set T 

Figure 2.3: The positively invariant set T = T + U T~ for the map M( 7jP ) with 
parameter 7 = .2. 

Finally, b n — 1 if and only if (u n ,v n ) G T + and (« n+ i, t>„+i) G T~ ; while b n = — 1 i/ 
and only if (u n ,v n ) G T~ and (u n+1 ,v n+1 ) G T+. 

Proof. First, as T = T + n T~ is the union of two convex sets, each containing the 
origin, x G T implies Ax G T for A G [0, 1]. As such, if T is positively invariant for the 
map Ay, then T must also be positively invariant for the full map M( 7j(0 ) = A^oD^^y 
In order to show that T is positively invariant for Ay, or that (u,v) G T implies 
Ay(w,t>) G T, we can assume without loss that that (u,v) G T + by symmetry of 
T and Ay. Note that the bit b in (2.12) associated to a point (u, v) G T + in the 
asymmetric scheme (2.12) is restricted to b G {0, 1}. We split the proof into two 
cases, according to whether b = or b = 1. 

I. Case 1: 6 = 1, or 7/2 < u + < 7/2 + 1: In this case, (u 1 , v') = Ay(u, v) = 
(u - l,v + u - 1). Since -1 < v! < 0, (u',v') G T if ««') G T~, or if 
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(u 1 , v') G T~ if —(7/2 + 1) < u' + 7?/ < 7/2. We can expand u' + 7-1/ as 

u' + jv ' = u + jv — (7 + 1) + 7«, 

and so indeed 

u' + 7 u' < (7/2 + 1) - (7 + 1) + 7 < 7/2, 

while also 

u' + 7«' > 7/2 - (7 + 1) + 7« > -(7/2 + 1). 
Thus, (u',v') G T~ and has associated bit 6' G { — 1,0}. 

2. Case 2: 6 = 0, or —7/2 < u + 71; < 7/2: Within this region, (u',v') = 
Ay(u,v) = (u,v + u). Since < v! < 1, («',«') G T if («',«') G T + , or if 
—7/2 < u' + 7f ' < 7/2 + 1. Proceeding as before, 

u + jv' = u + jv + ju < 7/2 + 7 < 7/2 + 1, 

while at the same time 

u + ryv' > -7/2 + 7-u > -7/2. 

Thus, (u',v') G T + and has associated bit b' G {0, 1}. 

The proposition follows by incorporating the symmetric result for initial conditions 
(u,v)eT~. □ 
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The alternating pattern of the bit sequence b n associated to iterates (u n , v n ) G T will 
be key in the following lemma. In particular, this constraint on the bit sequence, in 
combination with the imbalance induced by damping (u n+ i,v n+ i) = (pu n ,pv n ) only 
in the positive half-space S + = {(u, v) : u + jv > 0}, forces the u n , and in turn the 
v n , to zero. 

Lemma 2.0.19. Suppose that (u , v ) G T. Then the subsequence (u nj ) from (u n , v n ) = 
MP p \(uo, vq) consisting of indices rij for which u nj G T + satisfies u nj — > as n — > oo . 

Proof First note that the event (u n ,v n ) G T + must keep recurring, for if not, or if 
(u n ,v n ) T + for all n > 0, then b n = for all n according to Proposition 2.0.18, 
and 

n 

V n = V + ^ U 3 = v + (n+ 1)U 
3=0 

diverges. The index set X + = {nj : u nj G T + } is then an infinite subset of the natural 
numbers, so that (u nj ) represents an infinite subsequence of (w„). As u n+ i = pu n — 1 
in passing from T + to T~, u n+ i = u n while both (u n ,v n ) and (-u n+ i,f n+ i) remain in 
T _ , and u n+ i = u n + 1 in passing back from to T + , it follows that u nj+1 = pu nj 
contracts along the index set X + . The subsequence u nj = p j u no thus converges to 
zero as nj — > oo. □ 

Note that convergence of the subsequence {u nj ) in Lemma 2.0.19 does not guarantee 
convergence of the full sequence («„), as the residual sequence (w n J over the index 
set Z~ = N \ X + forms a subsequence satisfying u Hk — > — 1 as k — > oo, if X - is an 
infinite set. However, the following proposition ensures that this situation cannot 
occur. 
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Proposition 2.0.20. If(u Q ,v ) G T, then the full state sequence (u n ,v n ) = ^(uq,v ) 
satisfies (u n ,v n ) — > (0 + ,0 + ) as n — » oo. 

Proof. Using the aforementioned results, we assume without loss that (tio,^o) G T + , 
and that u < e for arbitrarily small e > 0. 

I. Consider the situation where (u n ,v n ) G T + for all n > 0. In this case, b n = 
for all n > 0, u n — > + , and v n is defined recursively according to 

v n+l = pv n + pw n _i 

= P^n-l + p 2 U n ~2 + pU n -l 



n— 1 



i=o 

= p n v + p n e, (2.16) 



where we use in the last equality that Uj = p% . In this case, then, v n — > 0. 

2. Suppose that p(w + 7^o) < 7/2, in which case bi — 0, ui = pu , and t>i = 
p(vo + u ), yielding 

p(ui + ) = p 2 (w + 7^o) + P 2 7 m o 

< p( 7 /2)+p 2 7 e, 

< 7/2, (2.17) 
the last inequality requiring that uq < e < (1 — p)/(2p 2 ), which is satisfied as 
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e was arbitrary. This case, then, reduces to the first case. 

3. It remains only to consider sequences (u n ,v n ) for which (u n ,v n ) G T + implies 
b n = 1, or p{u n + 7i> n ) > 7/2. In terms of Figure 2.3, such sequences avoid the 
lower parallel section of the set T + . The trajectory of such (u n) v n ) under the 
map M( 7 p ) can then be described as follows: 

(a) If (u n ,v n ) G T+, then 

b n = 1, 

u n+1 = pu n - 1, 

v n+1 = pv n + pu n - 1 < pv n + pe - 1 

Else if (M n ,w n ) G T~, 

(b) If -1/2 < m„/ 7 + v n < 1/2, then 

&„ = 0, 

u n+ i =u n <e-\, 

v n +i = v n + u n < v n + e - 1 

(c) If M n /7 + f n < —1/2, then 

b n = -1, 

M n+ i = m„ + 1 < e, 

^n+l = W„ + M„ + 1 < W„ + e 
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Since (c) cannot occur in successive iterations according to the alternating 
nature of b n , we arrive at the period-2 inequality 

v n+2 <v n + 2e-l, (2.18) 

indicating that the iterates v n diverge. This case, in conclusion, cannot occur. 

□ 

2.0.10 The invariant set T as a global trapping set for M( 7/9 ) 

We prove in this section that the positively invariant set T is also a global trapping 
set for M( 7)P ); this result in combination with the results of the last section guarantee 
that the origin is a global fixed point for M( 7 P ). 

To be precise, 

Definition 2.0.21. A global trapping set of a discrete mapping M : M. 2 — > M. 2 is any 

set S CR 2 such that 

1. M(S) C S (that is, S is a positively invariant set for M), 

2. for any x G 1R 2 ; there exists n > such that M n x G S. 

The construction of global trapping sets for the full second-order EA scheme was 
recently developed by Sidong [80]. We will follow the approach of that work and 
show that T is a global trapping set for the map M( 7 P ) using a Lyapunov function 
argument. In the context of discrete dynamical systems, a Lyapunov function refers 
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loosely to a nonnegative convex energy functional h : M. 2 — > R that contracts along 
the action of the discrete map, h(Mx) < (1 — 5)h(x). If h contracts as such for 
all x G IR 2 , and h(x) = only if x = 0, then all orbits (u n ,v n ) = M n (u ,vo) must 
converge to the global fixed point (0,0). More generally, if h decreases along orbits 
h(M n x.) < h(x) — 5 while the iterates M n x remain outside an invariant set S G M 2 , 
then S can be shown to be a globally attracting set, as follows: 

Lemma 2.0.22. Let S be a positively invariant set for the discrete map M . Suppose 
there exists a nonnegative function h : 1R 2 — > 1R + and a parameter 5 > for which, if 
x is not in S, then either M fc x G S or h(M k x) — h(x) < —5 after some finite time 
k. It follows that S is a global trapping set for M. 

Proof. Suppose x ^ S is such that M fc x ^ S for all k > 0, and denote c = h(x). 
From the stated hypotheses, h(M kl x.) < c — 5 after some finite time k 1: and, by 
induction, h(M kn x) < c — n8 after a finite time k n for any positive integer n. But 
then eventually /i(M fc x) < 0, which is impossible since h > by assumption. □ 

Lemma (2.0.22) has the following implication for the map M( 7 P ) = A 7 o -D( 7iP ) and 
invariant set T under consideration. 

Proposition 2.0.23. Consider the positively invariant set T as constructed in Propo- 
sition (2.0.18). T is a global trapping set for M( 7jjt) ) if there exists a nonnegative and 
convex function h : M. 2 — > IR + and a parameter e > for which the following sets are 
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contained in T: 



A h : {(u,v)eR 2 : 



h(u,v) < h(Aj(u,v))}, 



Q e h : {(u,v) G M 2 :h(u, v) < e} 



(2.19) 



Proof. We verify that the conditions for Lemma (2.0.22) hold for invariant set T and 
parameter 8 — e(l — p) > 0. The proof is split into two cases, according to whether 
or not x G S + = {(u, v) G M 2 : u + > 0} or S~ = M 2 \ S + . 

1. Case 1: Suppose first that x G S + . On this half-space, -D( 7 , p ) : x — > px acts 
as a contraction. If x ^ T, but £>( 7)P )X G T, then M( 7>p )X = A 7 o D^^x. G T 
by positive invariance of T for A 7 . If alternatively Z)( 7iP )X ^ T, then 



< /i(D (7jP) x) - /i(x), as A ft G T 
= /i(px) — /i(x) 

< p/i(x) — /i(x), using convexity of /i and h(0) = 

< -e(l -p), since G T 
= -5. 



2. Case 2: Suppose now that M ( fc 7 p) x G (S + ) c n T c for all fc > 0. Under this 
assumption, the quantization output is restricted to bj G { — 1,0} for all j, so 



h(M {jtf>) x) - /i(x) 
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that Uj = Uj-i — bj forms a monotonically nondecreasing sequence. If Uj < 
for all j, then \vj\ = \vo + Yjk=o u k\ diverges, which is impossible since (uj,Vj) 
belongs to a bounded set. If alternatively Uj > at some index j, then u n > 
for all n > j by monotonicity and, still, \vj\ — > oo. The case Uj = for all 
j > n cannot happen since (u n ,v n ) ^ T by assumption. It follows that this 
case is impossible. 

We can conclude that after a finite number of iterations k, either ^ e T or 
M[ 1P ) ^ S + . The result follows by application of Lemma 2.0.22. □ 

Following the approach in [80], we consider the following Lyapunov function for the 
map A 7 : 

h(u,v) = u 2 + \2v-u\. (2.20) 

Note that h is a convex function on M 2 . Letting h + (u, v) = u 2 + 2v — u and h~(u, v) = 
u 2 —2v+u, it is easily verified that h(u, v) = max {h + (u, v), h~(u, v)}. The motivation 
for this choice of Lyapunov function is that h + and Yr are the unique functional 
solutions satisfying the relations 



h + (A 1 (u,v))\ u+ro> ^ /2 = h + (u-l,u + v -1) = h + (u,v), 
h-(A 1 (u,v))\ u+jv< _ i/2 = h-(u + l,u + v + l) = h-(u,v). (2.21) 

It is straightforward that the positively invariant set T contains the set Q e h = {(u, v) : 
h(u, v) < e} for sufficiently small e. In order to apply Proposition (2.0.23), it remains 
only to verify that T also contains the set A h = {(u, u) G I 2 : h(u, v) < h^A^u, vj) }. 
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Proposition 2.0.24. Consider the map as defined in (2.0.17), the convex func- 
tion h(u, v ) = u 2 + \2v — u\, and the set R — RiU R 2 , with 

R 1 = {(u, v ) g M 2 : u + ^v > 0,2v + u < l,u < 1/2}, and 

R 2 = {( u ,v) EM 2 :u + -fv <0,2v + u>-l,u> -1/2}. (2.22) 

The set R is contained in the positively invariant set T (as shown in Figure 2.4)- 
Moreover, R contains the set Ah = {(u, v) E M? : h{u,v) < h[A^(u, v)) }. 

We defer the proof of Proposition 2.0.24, which amounts to a straightforward case 
by case analysis, to Appendix 1. 

The main result of this chapter, Theorem 3.5.1, follows from Proposition 2.0.23 
and Lemma 2.0.20. 

Proof of Theorem 3.5.1. By Proposition (2.0.23), T is a global trapping set for M( 7 P ), 
the discrete map governing (u n ,v n ) — > (u n+ i,v n+ i) in the asymmetrically-damped 
scheme (2.12) with 4-level quantizer Q P A and zero input /„ = 0. In other words, for 
any initial condition (u ,v ), the iterates (u n ,v n ) = M^™^(-u ,t> ) become trapped 
in the set T after a finite number of iterations. From Lemma 2.0.20, it follows that 
(u n ,v n ) — > (0 + ,0 + ); once the iterates (u n) v n ) E T + n {0 < u + < 7/2} become 
trapped in the positive quadrant, the bit output (6 n , q n ) = (0, 1) is constant. □ 
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-1.5 -1 -0.5 0.5 1 1.5 

u 

Figure 2.4: The positively invariant set T is the union of two parallelograms, as 
shown in dark gray for parameter 7 = .2. As the region Ah, on which the map A 1 
may satisfy Ay(h(u,v)) — A^iu^v) > 0, is contained in R (light gray triangles) which 
is in turn contained in T, the set T is a globally attracting set. 
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Figure 2.5: Different magnifications of an orbit of the map Mi^^ for p = .98 and 7 = .2. 
The initial point (uq,vq) = (—3.4, 12.7) can be seen in the top image; this point and the 
first few iterations are outside the trapping set T (dark gray parallelograms). Once trapped 
in T, the iterates (u n ,v n ) converge to the globally attracting fixed point (0,0). 
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2.1 Future Directions 



We have introduced a 2-bit second order SA scheme that is guaranteed to be quiet: 
periodicities in the bit output are eliminated when the input vanishes. It remains to 
analyze the stability of such quietness in the face of unavoidable component impre- 
cisions. Although a full analysis is beyond the scope of the current presentation, we 
outline a few key issues below. 

1. Randomness helps. Any introduction of randomness to a quantization 
scheme will only increase the aperiodicity of the bit output; this includes zero- 
mean additive noise on the samples x n = f n + e n and unbiased 'flakiness' in the 
quantizer element. Quietness for the asymmetrically-perturbed scheme (2.12) 
is threatened more from biased imprecisions, such as those resulting from an 
offset in the 4-level quantizer, Ql(u) = Ql(u + 5). Numerical simulations sug- 
gest that quietness of the asymmetric scheme nevertheless persists in the face 
of such offsets: the state sequence (u n) v n ) still converges at the onset of zero 
input, but the iterates v n approach a nonzero limit L(5) that grows with the 
magnitude of the shift; if 5 > is small enough, then quietness is still at- 
tained. More generally, quietness of the asymetrically-damped scheme appears 
to persist for a much more general class of quantizers than the the particular 
4-level quantizer Q P A that was needed for the theoretical analysis of the previous 
section; indeed, indistinguishable results are observed by implementing (2.12) 
with the far simpler symmetric 4-level quantizer, 
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Qi{u) 



(-1,0), 
(0,0), 



u < -1/2, 
-1/2 < u < 0, 
< m < 1/2, 



(2.23) 



< 



(0,1) 



(1,1), 



u > 1/2. 



2. Other robustness issues. 

(a) Imprecisions in p: The damping factor p does not have to be constant 
from iteration to iteration nor known precisely; all of the analysis of the 
previous section holds for p n > p x varying at each iteration, as long as 
the sequence remains bounded from below as to maintain second-order 
accuracy of the scheme (2.12), and is not chosen adversarialy to converge 
Pn -> 1. 

(b) Imprecisions in 7: We have until now assumed that the parameter 7 
is fixed throughout the iterations. However, numerical evidence suggests 
that quietness does not require that 7 be fixed in the expression u n + 7t> n , 
as long as 7 is within the range of stability, as stated in Theorem (2.0.15). 

(c) Leaky integrators. More realistically, one should incorporate integrator 
leakage into the proposed model (2.12), analogous to the modification 
studied for the golden ratio encoder in the last chapter. That is, the 
asymmetric scheme, being second order, requires two delay elements in 
its implementation to hold each of the states u n and v n over one iteration. 
After one clock time, the stored input in the first delay is reduced to 
Ai times its original value, while the stored input in the second delay is 
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replaced by A 2 times its original value; we can incorporate such leakage 
into the asymmetric model (2.12) as follows: 



(b n , q n ) 



Q 4 (Aim„_i/7 + \2Vn-1), 



u. 



Al (« n -l + q n (pi ~ l)u n -l) -K + /, 
A 2 (Un-l + <?n(P2 - l)«n-l) + «n- 



A 



"n 



n 



(2.24) 



Above, (Ai, A 2 ) G [.95, l] 2 in most circuits on interest; the specific leakage 
factors within this window are generally unknown and may vary slowly 
in time. When Ai = A 2 , the positively invariant set T is still a trapping 
set for the revised scheme; however, once in this set, it is not immediately 
clear how to modify the analysis of section 2.0.9 to prove that (u n ,v n ) — > 
(0,0). Numerical studies indicate that quietness persists in the face of 
such leakage, requiring only that the perturbations remain asymmetric, 
or that each of Aip n and A 2 p n is strictly smaller than either of Ai and A 2 . 

3. Hybrid Chaotic-Quiet £A Quantization 

While the asymmetry of the perturbation 1 — p n in (2.12) is key for inducing 
quietness, it is not clear that this perturbation need be nonnegative. So-called 
chaotic SA quantization [51] has been proposed as a means to eliminate idle 
tones in EA quantization, not just at the onset of zero input, but at the onset 
of any constant, or DC input sequence. In the context of 1-bit, second-order 
SA quantization, chaotic SA quantization corresponds to expanding, rather 
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than contracting, by a small factor (1 + e) > at each iteration: 

b n = Q(«n-l+7Un-l), 

u n = (1 + e)w n _! + /* - b n , 

v n = (1 + e)v n -i + u n . (2.25) 

This modification is termed chaotic because it has the effect of generating ape- 
riodic output at the onset of DC input; however, a complete stability analysis of 
the scheme (2.25) remains an open problem. Numerical studies, such as those in 
Figure 2.1, indicate that the chaotic scheme outperforms the asymmetrically- 
damped scheme (2.12) in generating aperiodic output for general DC input, 
although the asymmetric scheme shows a marked improvement over the fully 
damped tri-level scheme. While the chaotic scheme (2.25) is not quiet in the 
sense of Definition (2.0.12), we expect that quietness can be achieved by intro- 
ducing asymmetry into the model (2.25). Specifically, the hybrid scheme, 

(&n,5n) = Q4(Un-lh + Vn-l), 

u n = M n _i + q n eu n ^i -b n + 

v n = v n _i + q n ev n ^! + u n , (2.26) 

with q n taking values in {—1, 1} so that either damping (p n = 1— e) or expansion 
{p n = 1 + e) is applied at each iteration, appears to inherit the aperiodic orbit 
structure of its parent chaotic map (2.25), and also quietness induced by its 
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asymmetry. We hope to study the stability and aperiodicity for the chaotic 
maps (2.25), (2.26) in the future. 
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(a) Original and damped second order SA scheme (2.5) with p = 1 and p = .995, 
respectively 




(b) Asymmetrically damped SA scheme 
with p = .995. 




-1 -0.8 -0.6 -0.4 -0.2 0.2 0.4 0.6 0.8 1 



(c) Chaotic SA scheme with p = 1.01. 

Figure 2.6: One million iterates (u n ,v n ) from a single orbit under (a) the original 
damped second order scheme (2.5) with damping factor p = 1 and p = .995, re- 
spectively, (b) the asymmetrically-damped scheme (2.12), with asymmetric damping 
factor p = .995, and (c) the chaotic map (2.25), with expansion factor p = 1.01. 
Constant input f n = —.001 was used for^Sl cases, along with parameter 7 = .2 and 
either tri-level quantizer (a) or 4-level quantizer (6), (c). The asymmetrically-damped 
orbit (6) is seen to fill up more space in the plane than the fully damped orbits (a), 
but not as densely as the chaotic orbit (c), which appears to be aperiodic. 



2.2 Appendix : proof of Proposition 2.0.24 

Let S + = {(u, v) G M 2 : u + > 0} and let S~ = {(u, v) G R 2 : u + < 0} = E 2 \ 
S + . Suppose that (u,v) G S + \Ri is such that (u',v') = A 7 (u : v) = (u — l,u + v — 1). 
Our first aim is to show that h(Mi(u,v)) < h(u,v) in this situation. 

1. Case 1: If 2v — u > 0, then h(u, v) = u 2 + 2v — u, while h(u', v') = u 2 — 2u + 
1 + \u + 2v — l|,so 

h(u',v') < h(u,v) \u + 2v - 1| < u + 2w - 1 u + 2u > 1. 

But since G 5 + \ we know that w > 1/2, so u + 2v > 2u > 1, and 

h(u', v') < h(u, v) holds in this case. 

2. Case 2: If 2v — u < 0, then h(u,v) = u 2 + u — 2v, while the expression for 
h(u', v') remains unchanged, so 

h(u', v') < h(u, v) \u + 2v - 1| < 3u - 2v - 1. (2.27) 

We split this case into two subcases: 

(a) Case 2(a): If, on the other hand, u + 2v > 1, then |-u + 2t> — 1| = -u + 2t> — 1, 
and (2.27) simplifies to 

h(u', v') < h(u, v) u + 2v - 1 < 3m - 2v - 1 2v < u. 

But since 2v — u < by assumption, the result holds in this subcase. 
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(b) Case 2(b): If u + 2v < 1, then 



h(u', v') < h(u, v) -u-2v + l<3u-2v-l 

u > 1/2. (2.28) 

But of course, the condition u > 1/2 holds throughout S + \ R±. 

We have shown thus far that h(u',v') < h(u,v) if (u,v) G S + \ R\ and Mi(u,v) = 
(u — 1, u + v — 1). It remains to show that h(u,u + v) < h(u, v) if (w, i>) G S* + and 
F 7 («,t>) < 7/2. By inspection of Figure 1, the intersection of the regions S + \ R\ 
and {(u, v) : F^(u,v) < 7/2} consists of two disjoint sets: Pi : {(u, v) : < u + 'yv < 
7/2, u + 2v > 1, u < 0}, and P 2 : {(w, u) : < w + < 7/2, u + 2u < -1, u > 0}. 

1. Case 1: (u, v) G Pi: As P± C i>) : u < 0, i> > 0}, the restriction 2v — u > 
trivially holds, and so t> ) = u 2 + 2t> — u, and 

-u + v ) < t>) -u 2 + |2(w + v) — u\ <u 2 + 2v — u 

-<=>- \2v + u\ <2v — u 

-<=>- 2t> + u < 2v — u 

<^ u<0 

which is satisfied by assumption. 
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2. Case 2: (u,v) G P2 : P2 is contained in the quadrant {(w, t>) : w > 0,t> < 0}, 
and so 2v — u < 0, t>) = u 2 — 2t> + u, and 

h(u,u + v) < h(u,v) -<=>- |w + 2t>| < u — 2v -<=>- — (-u + 2t>) < u — 2v -u > 0, 

which again is satisfied by assumption. 

By symmetry of the set R and the map A y , we have also that /i(A 7 (m, v)) < h(u,v) 
if (u,v) e S- nR\R 2 . 



91 



Chapter 3 



Cross validation in compressed 
sensing 



3.1 Compressed sensing: Redefining traditional 
notions of sampling 

Compressed Sensing (CS) is a fast developing area in applied mathematics, motivated 
by the reality that most data we store and transmit contains far less information than 
its dimension suggests. For example, a one-dimensional slice through the pixels in 
a typical grayscale image will contain segments of smoothly varying intensity, with 
sharp changes between adjacent pixels appearing only at edges in the image. If a 
large data vector contains only k « N nonzero entries, or is k-sparse, it is common 
practice to temporarily store the entire vector, possibly with the intent to go back 
and replace this vector with a smaller dimensional vector encoding the location and 
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magnitude of its k significant coefficients. In compressed sensing, one instead collects 
fewer fixed linear measurements of the data to start with, sufficient in number to 
recover the location and numerical value of the k nonzero coordinates at a later time. 
Finding "good" linear measurements, as well as fast, accurate, and simple algorithms 
for recovering the original data from these measurements, are the twofold goals of 
Compressed Sensing research today. 

In the sequel, we restrict our focus to signals that can be represented as real-valued 
vectors x = (x 1 , x 2 , x^) G M. N . This signal model works well for digital images, 
which are naturally sparse: for example, a one-dimensional slice through the pix- 
els in a typical grayscale image will contain segments of smoothly varying intensity, 
with sharp changes between adjacent pixels appearing only at edges in the image. 
Often this sparsity in information translates into a sparse (or approximately sparse) 
representation of the data with respect to some standard basis; for the image ex- 
ample, the basis would be a wavelet of curvelet basis. For such N dimensional data 
vectors that are well approximated by a A;-sparse vector (or a vector that contains at 
most k « N nonzero entries), it is common practice to temporarily store the entire 
vector, possibly with the intent to go back and replace this vector with a smaller 
dimensional vector encoding the location and magnitude of its k significant coeffi- 
cients. In compressed sensing, one instead collects fewer fixed linear measurements 
of the data to start with, sufficient in number to recover the location and numerical 
value of the k nonzero coordinates at a later time. Finding 'good' linear measure- 
ments, as well as fast, accurate, and simple algorithms for recovering the original data 
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from these measurements, are the twofold goals of compressed sensing research today. 

Review of basic CS setup. The data of interest is taken to be a real-valued 
vector x e M. N that is unknown, but from which we are allowed up to m < N linear 
measurements, in the form of inner products of x with m vectors Vj G M. N of our 
choosing. Letting <£> denote the m x N matrix whose jth row is the vector Vj, this is 
equivalent to saying that we have the freedom to choose and store anmxiV matrix 
$, along with the m-dimensional measurement vector y = $x. Of course, since $ 
maps vectors in M. N to vectors in a smaller dimensional space M m , the matrix $ is 
not invertible, and we thus have no hope of being able to reconstruct an arbitrary 
N dimensional vector x from such measurements. 

However, if the otherwise unknown vector x is specified to be A;-sparse, and k is 
fairly small compared with N, then there do exist matrices $ for which y = &x 
uniquely determines x, and allows recovery of x using fast and simple algorithms. It 
was the interpretation of this phenomenon given by Candes and Tao [18], [17], and 
Donoho [40], that gave rise to compressed sensing. In particular, these authors define 
classes of matrices that possess this property. One particularly elegant characteri- 
zation of this class is via the Restricted Isometry Property (RIP) [17]. A matrix <3> 
with unit normed columns is said to be fc-RIP if all singular values of any k column 
submatrix of <3> lie in the interval [1 — 5, 1 + 5] for a given constant S < 1. For a fixed 
value of 5 > 0, an m x iV matrix $ whose entries $jj are independent realizations 
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of a Gaussian or Bernoulli random variable satisfies 2/c-RIP of level 



k = K(S,m,N) := c 1 (5)m/ \og(N/m)), where m < -N, 



(3.1) 



with probability > 1 — 2exp— c 2 (S)m [6]. Also with high probability, an m x N 
matrix obtained by selecting m rows at random from the N x N discrete Fourier 
matrix satisfies 2/c-RIP of the same order as (3.1) up to an additional log (log N) 
factor [66]. In fact, the order of k given by (3.1) is optimal given m and N, as shown 
in [26] using classical results on Gelfand widths of l± unit balls in To date, there 
exist no deterministic constructions of RIP matrices of this order. 

Recovering or approximating x. As shown in [?], the following approximation 
results hold for matrices <3> that satisfy 2/c-RIP with constant 5 2 k < 2(3 — v / 2)/7 ~ 



1. If x G M N is A;-sparse, then x can be reconstructed from $ and the measurement 
vector y = $x as the solution to the following t\ minimization problem: 



2. If x is not A;-sparse, the error between x and the approximation x = £i(§,y) 
is still bounded by 



.4531: 



x = £i($, y) := arg min ||z||i. 



&z=y 



(3.2) 



x-x\\ 2 < -=(7* (a;),*, 



(3.3) 



where c = 2(1 + 5 2k )/(l - S 2k ), and <J k (x)i» ■ 
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best possible approximation error in the metric of 1^ between x and the set 
of /c-sparse signals in WL N . The approximation error <Jk(x)iN is realized by the 
/c-sparse vector x k e M. N that corresponds to the vector x with all but the k 
largest entries set to zero, independent of the 1^ norm in the approximation 
cr k (x)iN. 

This immediately suggests to use the £i-minimizer Ci as a means to recover or ap- 
proximate an unknown x with sparsity constraint. Several other decoding algorithms 
are used as alternatives to l\ minimization for recovering a sparse vector x from its 
image y = <£>£, not because they offer better accuracy ( l\ minimization gives optimal 
approximation bounds when $ satisfies RIP), but because they can be faster and 
easier to implement. For a comprehensive survey on compressed sensing decoding 
algorithms, we refer the reader to [57]. 

3.2 Estimating the accuracy of compressed sens- 
ing estimates 

According to the bound (3.3), the quality of a compressed sensing estimate x = 
A(<E>,y) depends on how well x can be approximated by a /c-sparse vector, where 
the value of k is determined by the number of rows m composing $. While k is 
assumed to be known and fixed in the compressed sensing literature, no such bound 
is guaranteed for real-world signal models such as vectors x G WL N corresponding to 
wavelet coefficient sequences of discrete photograph-like images. Thus, the quality 
of a compressed sensing estimate x in general is not guaranteed. 
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If the error \\x — £|| 2 /||a;||2 incurred by a particular approximation x were observed 
to be large, then decoding could be repeated using more measurements, perhaps at 
increasing measurement levels {mi,m 2 , ...,m p }, until the error \\x — %||2/||^||2 corre- 
sponding to rrij measurements were observed to be sufficiently small. Of course, the 
errors \\x — Xj\\ 2 and \\x — Xj\\ 2 /\\x\\ 2 are typically not known, as x is unknown. Our 
main observation in this chapter is that one can apply the Johnson-Lindenstrauss 
lemma [52] to the set of p points, 

{(x - x ± ), (x - x 2 ), (x - x p )}. (3.4) 

In particular, r = O(logp) measurements of x, provided by y^ = \E% when \l> is, e.g. 
a Gaussian or Bernoulli random matrix, are sufficient to guarantee that with high 
probability, 

4/511?/* - ^Xj\\ 2 < \\x - xj\\ 2 < 4/3||?m> - Vxj\\2 (3.5) 

and 

lly^lh ~ \\x\\ 2 ~ lly^lh 

for any p compressed sensing estimates. The equivalences (3.5) and (3.6) allow the 
measurable quantities \\y^ — ^Xj\\ 2 and \\y^ — ^jlh/H^lh to function as proxies for 
the unknown quantities \\x — Xj\\ 2 and \\x — Xj\\ 2 /\\x\\ 2 ] these proxies can be used to 

(a) provide tight numerical upper and lower bounds on the errors \\x — Xj\\ 2 and 
\\x — % || || 2 at up to p compressed sensing estimates Xj, 
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(b) provide estimates of the underlying fc-term approximations \\x — x k \\ 2 of x for 
up to p different values of k, and 

(c) return from among a sequence of estimates (x±, ...,x p ) with different initializa- 
tion parameters, an estimate x cv = argmin^ — ^£j||2 having error x cv \\ 2 
that does not exceed a small multiplicative factor of the best possible error in 
the metric of £ 2 between x and an element from the sequence at hand. 

More precisely, all CS decoding algorithms require as input a parameter m corre- 
sponding to the number of rows in $; some compressed sensing decoding algorithms 
(such as greedy algorithms) require also a parameter k indicating the sparsity level 
of x, and other algorithms require as input a bound 7 on the expected amount of 
energy in x outside of its significant coefficients. All CS decoding algorithms can be 
symbolically represented by functions of the form A($,y, k,j), and we will give ex- 
amples where each of the parameters m, k, and 7 can be optimized over a sequence of 
estimates (xi,x 2 , ...,x p ) indexed by increasing hypotheses on each of the parameters 
m, k, and 7. 

The method we propose for parameter selection and error estimation in compressed 
sensing is reminiscent of cross validation, which is a technique used in statistics and 
learning theory whereby a data set is separated into a training/estimation set and a 
test/cross validation set, and the test set is used to prevent overfitting on the training 
set by estimating underlying noise parameters. Indeed, we take a set of m measure- 
ments of the unknown x, and use m — r of these measurements, $x, in a compressed 
sensing decoding algorithm to return a sequence (xi,x 2 , ...) of candidate approxima- 
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tions to x. The remaining r measurements, ^x, are then used to identify from among 
this sequence a 'best' approximation x = Xj, along with an estimate of the sparsity 
level of x. Although the application of cross validation in compressed sensing has 
been previously proposed by Boufounos, Duarte, and Baraniuk in [11], the context 
in which it is studied there is different from that of the present paper (we will discuss 
this difference further in the last section), and in their application one cannot im- 
mediately apply the mathematical justification of the Johnson Lindenstrauss lemma 
that we present below. 

3.3 Preliminary notation 

Throughout the paper, we will be dealing with large dimensional vectors that have 
few nonzero coefficients. We use the notation \x\ = n to indicate that a vector 
x G M. N has exactly n nonzero coordinates. 

We will sometimes use the notation a ~ e b as shorthand for the multiplicative relation 



that can be worded as "the quantity a approximates the quantity b to within a mul- 
tiplicative factor of (1 ± e)". Note that the relation ~ e is not symmetric. Properties 
of the relation a ~ e b are listed below; we omit the proofs, which amount to a string 
of simple inequalities. 




(3.7) 



Lemma 3.3.1. Fix e G (0, 1). 
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1. If a, b G R + satisfy a ~ e 6 ; £/ien 6/ [(1 + e)(l — e)] ~ e a- 

2. If a,b,c,d G IR + satisfy a ~ e 6 and c ~ e inen a/c ~a b/d for parameter 
6 = 2e/l-e. 

3. If (ai, a 2 , a p ) and (6i, 6 2 , 6 P ) are sequences in M + , and aj ~ e Oj /or eaca 
1 < j < P, then miiij a,- ~ e mim, bj . 

3.4 Mathematical foundations 

The Johnson Lindenstrauss (JL) lemma, in its original form, states that any set of p 
points in high dimensional Euclidean space can be embedded into e~ 2 log(p) dimen- 
sions, without distorting the distance between any two points by more than a factor 
of (1 ± e) [52]. In the same paper, it was shown that a random orthogonal projection 
would provide such an embedding with positive probability. Following several sim- 
plifications to the original proof [46], [49], [29], it is now understood that Gaussian 
random matrices, among other purely random matrix constructions, can substitute 
for the random projection in the original proof of Johnson and Lindenstrauss. Of the 
several versions of the lemma now appearing in the literature, the following variant 
presented in Matousek [55] is most applicable to the current presentation. 

Lemma 3.4.1 (Johnson-Lindenstrauss Lemma). Fix an accuracy parameter e G 
(0, 1/2], a confidence parameter 5 G (0, 1), and an integer r > r = Ce~ 2 log ^. 
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Let M. be a random r x N matrix whose entries M-ij are independent realizations of 
a random variable R that satisfies: 

1. Var [R\ = 1/r (so that the columns of M. have expected £ 2 norm 1) 

2. E[R] = 0, 

3. For some fixed a > and for all \, 

Prob [\R\ > A] < 2e~ aX2 (3.8) 

Then for a predetermined x G M. N , 

(1 - e)\\x\\^ <\\Mx\\r 2 < (l + e)W lS r (3.9) 
is satisfied with probability exceeding 1—5. 

The constant C bounding r in Lemma 3.4.1 grows with the parameter a specific 
to the construction of M. (3.8). Gaussian and Bernoulli random variables R will 
satisfy the concentration inequality (3.8) for a relatively small parameter a (as can 
be verified directly), and for these matrices one can take C = 8 in Lemma 3.4.1. 

The Johnson Lindenstrauss lemma can be made intuitive with a few observations. 
Since E[i2] = and Var[i?] = the random variable equals in ex- 

pected value; that is, 

E[ \\Mx\\l } = \\x\\l (3.10) 
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Additionally, ||A^x||| inherits from the random variable R a nice concentration in- 
equality: 

Proh[\\Mx\\i - \\x\\l > e\\x\\ 2 2 ] < e - a{2 ^ r? < 5/2. (3.11) 

The first inequality above is at the heart of the JL lemma; its proof can be found 
in [55]. The second inequality follows using that r > (2ae 2 )~ 1 log(f ) and e < 1/2 by 
construction. A bound similar to (3.11) holds for Prob[||A^a;||| — ||#||| < — ^Mll] as 
well, and combining these two bounds gives desired result (3.9). 

For fixed x G M. N , a random matrix Ai constructed according to Lemma 3.4.1 fails to 
satisfy the concentration bound (3.9) with probability at most 5. Applying Boole's 
inequality, M. then fails to satisfy the stated concentration on any of p predetermined 
points {xj} P j = i, Xj G M. N , with probability at most £ = pS. In fact, a specific value 
of £ G (0, 1) may be imposed for fixed p by setting S — £/p. These observations are 
summarized in the following corollary to Lemma 3.4.1. 

Corollary 3.4.2. Fix an accuracy parameter e G (0, 1/2], a confidence parameter 
£ G (0, 1), and fix a set of p points {xj} p =l C M. N . Set 5 = ^/p, and fix an integer 
r > r Q = Ce~ 2 log ^ = Ce~ 2 log If M. is a r x N matrix constructed according to 
Lemma 3.4.1, then with probability > 1 — £, the bound 

(l-e)IK-ll^ < WMxjWtr < (1 + 6)11^ (3.12) 
obtains for each j = 1,2, ...,p. 
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3.5 Cross validation in compressed sensing 

We return to the situation where we would like to approximate a vector x G M. N 
with an assumed sparsity constraint using m < N linear measurements y = Ax 
and m x N matrix A of our choosing. Continuing the discussion in Section 1, we 
will not reconstruct x in the standard way by x = A(A, y, k, 7) for fixed values of 
the input parameters, but instead separate the m x N matrix A into an n x N 
implementation matrix <E> and an r x N cross validation matrix and separate the 
measurements y accordingly into y$ and y^. We use the implementation matrix <3> 
and corresponding measurements y$> as input into the decoding algorithm to obtain a 
sequence of possible estimates (x±, x p ) corresponding to increasing one of the input 
parameters m, k, or 7. We reserve the cross validation matrix \1/ and measurements 
y^ to estimate each of the error terms \\x — Xj\\ 2 in terms of the computable \\y$, — 
\l/5;j||2. We will also estimate the unknown oracle error, 



corresponding to the best possible approximation to x in the metric of l 2 from the 
sequence (xi,...,x p ), using the computable cross validation error, 



Our main result, which follows from Corollary (3.4.2), details how the number of 
cross validation measurements r should be chosen in terms of the desired accuracy e 



= min \\x — xA\,n 
i<j<P 32 



(3.13) 




(3.14) 
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of estimation, confidence level £ in the prediction, and number p of estimates Xj to 
be measured. 

Theorem 3.5.1. For a given accuracy e G (0, 1/2], confidence £ G (0, 1), and num- 
ber p of estimates Xj G M. N , it suffices to allocate r = \Ce~ 2 log ^] rows to a cross 
validation matrix \& of Gaussian or Bernoulli type, normalized according to Lemma 
3.4-2 and independent of the estimates Xj, to obtain with probability greater than or 
equal to 1 — £, and for each j = 1, 2, ...,p, the bounds 



1 \\x — xAliN l 



(3.15) 



and 



l-3e 



< 



\ x ~ XjWqi \\x\hN 



< 



(1 + e)(l - e) 2 " - x^U/WMU " (! - e ) S 



(3.16) 



and a/so 



1 + e 



< ^ < 



(3.17) 



Proof. . 

• The bounds in (3.15) are obtained by application of Lemma 3.4.2 to the p 
points and rearranging the resulting bounds according to Lemma 
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3.3.1 part (1). The bound (3.17) follows from the bounds (3.15) and part (3) 
of Lemma 3.3.1. 

• The bounds in (3.16) are obtained by application of Lemma 3.4.2 to the p + 1 
points u Q = x,Uj = x — Xj, and regrouping the resulting bounds according to 
part (2) of Lemma 3.3.1. 

□ 

Remark 3.5.2. The measurements making up the cross validation matrix \1> must 
be independent of the measurements comprising the rows of the implementation 
matrix $. This comes from the requirement in Lemma 3.4.1 that the matrix ^ be 
independent of the points Uj — Ob OC j • This requirement is crucial, as observed when 
x solves the i\ minimization problem 

x = arg min \\z\\i subject to $z = &x, (3.18) 

zeR N 

in which case the constraint $(£ — x) = clearly precludes the rows of $ from giving 
any information about the error \\x — x\\2- 

Remark 3.5.3. If the full compressed sensing matrix A =[$ ; \P] is itself of Gaussian 
or Bernoulli type, then one can obtain a more accurate approximation to the unknown 
quantities \\x — Xj\\/\\x\\ by using all of the measurements Ax in the estimates — 

%)IIf 2 /IWI^- 

Remark 3.5.4. Proposition 3.5.1 should be applied with a different level of care 

depending on what information about the sequence (x Ob ^ . Ob Ob 2 j ' ' ' 1 *^ ^ 

p ) is sought. 
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If the minimizer x = arg mini<j< p \\^(x — x~j)\\ir is sufficient for one's purposes, 
then the precise normalization of ^ in Proposition 3.5.1 is not important. The 
normalization doesn't matter either for estimating the normalized quantities \\x — 
Xj || 2/ \\x || 2- On the other hand, if one is using cross validation to obtain estimates 
for the quantities ||x — Xj\[2, then normalization is absolutely crucial, and one must 
observe the normalization factor given by Lemma 3.4.2 that depends on the number 
of rows r allocated to the cross validation matrix ^ . 

3.6 Applications 

3.6.1 Estimation of the best /c-term approximation error 

We have already seen that if the m x N matrix $ satisfies 2/c-RIP with parameter 
5 <~ .45, and x = £i($,$x) is returned as the solution to the t\ minimization 
problem (4.3.1), then the error between x and the approximation x is bounded by 

\\x - x\\ 2 < ~^^k(x)iN. (3.19) 

Several other decoding algorithms in addition to t\ minimization enjoy the recon- 
struction guarantee (3.19) under similar bounds on <3>, such as the Iteratively Reweighted 
Least Squares algorithm (IRLS) [32], and the greedy algorithms CoSAMP [57] and 
Subspace Pursuit [27]. It has recently been shown [78] [38] that if the bound (3.19) 
is obtained, and if x — x lies in the null space of $ (as is the case for the decoding 
algorithms just mentioned), then if $ is a Gaussian or a Bernoulli random matrix, 
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the error ||a; — x\\ 2 also satisfies a bound, with high probability on $, with respect to 
the £ 2 residual, namely 

\\x — x\\2 < cak(x)iN , (3.20) 

for a reasonable constant c depending on the RIP constant 5 2 k of $. In the event 
that (3.20) is obtained, a cross validation estimate \\^/(x — x) ||^ can be used to lower 
bound the residual a k {x)q , with high probability, according to 

{l-e)\\^{x-x)\\ lt2 < \\x-x\\ x n <ca k (x)^, (3.21) 

with O(^) rows reserved for the matrix \1> (3.5.1). At this point, we will use Corollary 
3.2 of [58], where it is proved that if the bound (3.19) holds for x with constant c, 
then the same bound will hold for 

Xh = arg min llx — z\\, n 

the best /c-sparse approximation to x, with constant c = 3c. Thus, we may assume 
without loss of generality that x is /c-sparse, in which case (^(x — x)\\ir also provides 
an upper bound on the residual ak(x) t N by 

(l + e)\\^(x-x)\\r 2 > > ^(x) t N. (3.23) 

With almost no effort then, cross validation can be incorporated into many decoding 
algorithms to obtain tight upper and lower bounds on the unknown /c-sparse approx- 
imation error ak(x)iN of x. More generally, the allocation of lOlogp measurements 
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(3.22) 



to the cross validation matrix \& is sufficient to estimate the errors (\\x — x^ || 2)^=1 
or the normalized approximation errors {\\x — ^fcJ^/H^lh)^! at p sparsity levels kj 
by decoding p times, adding rrij measurements to the implementation matrix $j at 
each repetition. Recall that the quantities kj and rrij are related according to (3.1). 

3.6.2 Choice of the number of measurements m 

Photograph-like images have wavelet or curvelet coefficient sequences x e M. N that 
are compressible [37] [16], having entries that obey a power law decay 

M (fc ) < c s k~\ (3.24) 

where x^) denotes the kth largest coefficient of x in absolute value, the parameter 
s > 1 indicates the level of compressibility of the underlying image, and c s is a 
constant that depends only on s and the normalization of x. From the definition 
(3.24), compressible signals are immediately seen to satisfy 

\\x - x k \\i/Vk < c' s k~ s+1/2 , (3.25) 

so that the solution x m = £i($, to the t\ minimization problem (4.3.1) using an 
m x N matrix $ of optimal RIP order k = cm/ \og(N/m)) satisfies 

\\x-x m \\ 2 < c S)5 k- s+l/2 . (3.26) 
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The number of measurements m needed to obtain an estimate x m satisfying \\x — 
Xm\\2 < t for a predetermined threshold r will vary according to the compressibility of 
the image at hand. Armed with a total of m measurements, the following decoding 
method that adaptively chooses the number of measurements for a given signal x 
presents a more democratic alternative to standard compressed sensing decoding 



Table 3.1: CS decoding structure with adaptive number of measurements 

1. Input: The m- dimensional vector y = $x, the m x N matrix <£, (in some algorithms) 
the sparsity level k, and (again, in some algorithms) a bound 7 on the noise level 
of x, the number p of row subsets of <I>, (<J?i, <3?2> •••> 3>p) ; corresponding to increasing 
number of rows mi < 771,2 < ••• < m p < 777, and threshold r > 0. 

2. Initialize the decoding algorithm at j = 1. 

3. Estimate stj = A($ m ., !/ m -, fc,7) with the decoder A at hand, using only the first 
777j measurement rows of The previous estimate can be used for "warm 
initialization" of the algorithm, if applicable. The remaining rj=m — rrij measure- 
ment rows are allocated to a cross validation matrix Hfj that is used to estimate the 
resulting error \\x — Xjl^/H^lh- 

4. Increment j by 1, and iterate from step 3 if stopping rule is not satisfied. 

5. Stop: at index j = j* < p if \\x — x mj \\2/\\x\\2 < t holds with near certainty, as 
indicated by 



according to Proposition 3.5.1. If the maximal number of decoding measurements 
777 p < 777 have been used at iteration p, and (3.27) indicates that \\x — x mp H2/IMI2 > 
r still, return x rn = A($,y, k, 7) using all m measurements, but with a warning 
that the underlying image x is probably too dense, and its reconstruction is not 
trustworthy. 



structure: 




(3.27) 
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3.6.3 Choice of regularization parameter in homotopy-type 
algorithms 

Certain compressed sensing decoding algorithms iterate through a sequence of in- 
termediate estimates Xj that could be potential optimal solutions to x under cer- 
tain reconstruction parameter choices. This is the case for greedy and homotopy- 
continuation based algorithms. In this section, we study the application of cross vali- 
dation to the intermediate estimates of decoding algorithms of homotopy-continuation 
type. 

LASSO is the name coined in [71] for the problem of minimizing of the following 
convex program: 

xM = arg min \\&x - $z\\e 2 (3.28) 

||z||l<T 

The two terms in the LASSO optimization problem (3.28) enforce data fidelity and 
sparsity, respectively, as balanced by the regularization parameter r. In general, 
choosing an appropriate value for r in (3.28) is a hard problem; when <3> is an underde- 
termined matrix, as is the case in compressed sensing, the function f(r) = \\x — x^\\ 2 
is unknown to the user but is seen empirically to have a minimum at a value of r in 
the interval [0, ||$x||oo] that depends on the unknown noise level and/or and com- 
pressibility level of x. 

The homotopy continuation algorithm [53], which can be viewed as the appropri- 
ate variant of LARS [53], is one of many algorithms for solving the LASSO problem 
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(3.28) at a predetermined value of r; it proceeds by first initializing r' to a value 
sufficiently large to ensure that the l\ penalization term in (3.28) completely domi- 
nates the minimization problem and = trivially. The homotopy continuation 
algorithm goes on to generate for decreasing r' until the desired level for r is 
reached. If r = 0, then the homotopy method traces through the entire solution 
path x^ E R N for r' > before reaching the final algorithm output x^ = Ci(<&,y) 
corresponding to the i\ minimizer (4.3.1). 

From the non-smooth optimality conditions for the convex functional (3.28), it can 
be shown that the solution path x' T l e M. N is a piecewise-affine function of r [53], 
with "kinks" possible only at a finite number of points r G {r 1 ,r 2 ,...}. Proposi- 
tion 3.5.1 suggests a method whereby an appropriate value of r* can be chosen 
from among a subsequence of the kinks (ti, r 2 , r p ) by solving the minimization 
problem x' T *l = argmm,-< p \\^(x — x^)\\ 2 for appropriate cross validation matrix 
^. Moreover, since the solution x — x^ for tj < t < t,-+i is restricted to lie in the 
two-dimensional subspace spanned by x — x^ and x — x^ Tj+1 \ one can combine the 
Johnson Lindenstrauss Lemma with a covering argument analogous to that used to 
derive the RIP property for Gaussian and Bernoulli random matrices in [6], to cross 
validate the entire continuum of solutions x^ between T\ < r < r p . More precisely, 
the following bound holds under the conditions outlined in Theorem (3.5.1), with 
the exception that 2r (as opposed to r) measurements are reserved to 

1 ^ min Tl < T < Tp \\x - x [t % 1 
1 + e ~ min n < T < Tp \\V(x - xW)\\ 2 ~ 1 - e 
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(3.29) 



Unfortunately, the bound (3.29) is not strong enough to provably evaluate the 
entire solution path x^ for r > 0, because the best upper bound on the number 
of kinks on a generic LASSO solution path can be very large. One can prove that 
this number is bounded by 3^, by observing that if x^ and x 1 ^ have the same sign 
pattern, then x^ also has the same sign pattern for T\ < t < r 2 . Applying Proposi- 
tion 3.5.1 to p = 3 N points x — Xj, this suggests that O(N) rows would need to be 
allocated to a cross validation matrix \I> in order for Proposition 3.5.1 and the corol- 
lary (3.29) to apply to the entire solution path, which clearly defeats the compressed 
sensing purpose. However, whenever the matrix $ is an m x N compressed sensing 
matrix of random Gaussian, Bernoulli, or partial Fourier construction, it is observed 
empirically that the number of kinks along a homotopy solution path behaves like 
0(m) for generic vectors x e M. N used to generate the path, see Figure 3.1. This 
suggests, at least heuristically, that the allocation of O(logm) out of m compressed 
sensing measurements of this type suffices to ensure that the error \\x — x 1 ^ || 2 for the 
solution x^ = argmin T > ||\l/(x — x^) || 2 will be within a small multiplicative factor of 
the best possible error in the metric of obtainable by any approximant x^ along 
the solution curve r > 0. At the value of r corresponding to x^ T \ the LASSO solution 
(3.28) can be computed using all m measurements $ as a final approximation to x. 

The Dantzig selector (DS) [19] refers to a minimization problem that is similar 
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Figure 3.1: Number of steps in the homotopy continuation algorithm, as described in 
Section 3.6.3, as a function of the number of rows m comprising the encoding matrix 
$. The thick solid, dashed, and dotted lines represent a bound satisfied by 95 out of 
100 trials corresponding to independent realizations of an m x N Gaussian random 
matrix $ and iV-dimensional Gaussian random vector x, when N = 500, 1000, and 
2000, respectively. The lines are generated from data at m = 125, 250, and 500. The 
number of steps appears to depend sublinearly on the number of rows m and does 
not seem to depend on the ambient dimension N. The thin solid line corresponds to 
the observed upper bound of 2m. 
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in form to the LASSO problem: 

x T — arg min \\Qx — &z\\e + r\\z\\i (3.30) 

zeR N 

The difference between the DS (3.30) and LASSO (3.28) is the choice of norm (1^ 
versus £ 2 ) on the fidelity-promoting term. Homotopy-continuation based algorithms 
have also been developed to solve the minimization problem (3.30) by tracing through 
the solution path x T > for r' > r. As the minimization problem (3.30) can be refor- 
mulated as a linear program, its solution path x T e M. N is seen to be a piecewise 
constant function of r, in contrast to the LASSO solution path. In practice, the 
total number of breakpoints (ti, t 2 , ...) in the domain < r is observed to be on the 
same order of magnitude as m when the mx N matrix $ satisfies RIP [50] ; thus, the 
procedure just described to cross validate the LASSO solution path can be adapted 
to cross validate the solution path of (3.30) as well. 

Thus far we have not discussed the possibility of using cross validation as a stopping 
criterion for homotopy-type decoding algorithms. Along the LARS homotopy curve 
(3.28), most of the breakpoints (ti, r 2 , ...) appear only near the end of the curve in a 
very small neighborhood of r = 0. These breakpoints incur only miniscule changes in 
the error \\x — x Tj \\2 even though they account for most of the computational expense 
of the LARS decoding algorithm. Therefore, it would be interesting to adapt such 
algorithms, perhaps using cross validation, to stop once r* is reached for which the 
error \\x — x T * \\ 2 is sensed to be sufficiently small. 
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3.6.4 Choice of sparsity parameter in greedy-type algorithms 

Greedy compressed sensing decoding algorithms also iterate through a sequence of 
intermediate estimates Xj that could be potential optimal solutions to x under certain 
reconstruction parameter choices. Orthogonal Matching Pursuit (OMP), which can 
be viewed as the prototypical greedy algorithm in compressed sensing, picks columns 
from the implementation matrix $ one at a time in a greedy fashion until, after k 
iterations, the /c-sparse vector a linear combination of the k columns of $ chosen 
in the successive iteration steps, is returned as an approximation to x. The OMP 
algorithm is listed in Table 2. Although we will not describe the algorithm in full 
detail, a comprehensive study of OMP can be found in [72] . 
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Table 3.2: Orthogonal Matching Pursuit Basic Structure 



1. Input: The m-dimensional vector y = Bx, the m x N encoding matrix <1> whose j 
column is labeled <pj, and the sparsity bound k. 

2. Initialize the decoding algorithm at j = 1, the residual ro = y, and the index set 
A = 0- 

3. Estimate 

(a) Find an index Aj that realizes the bound ($ rj-i)\ j = ||<& rj_i||oo. 

(b) Update the index set Aj = Aj_iUAj and the submatrix of contributing columns: 
*j = <I> / i- <^] 

(c) Update the residual: 

Sj = argmin \\$jX - y\\ 2 = ($j® j y 1 $jy, 

aj = <S>jXj 

r i = r i-i-«i- 

(d) The estimate Xj for the signal has nonzero indices at the components listed in 
Aj , and the value of the estimate Xj in component Aj equals the ith component 
of Sj. 

4. Increment j by 1 and iterate from step 3, if j < k. 

5. Stop: at j = k. Output x omp = Xk as approximation to x. 



Note in particular that OMP requires as input a parameter k corresponding to the 
expected sparsity level for x G M. N . Such input is typical among greedy algorithms 
in compressive sensing (in particular, we refer the reader to [57], [32], and [27]). As 
shown in [72], OMP will recover with high probability a vector x having at most 
k < m/\ogN nonzero coordinates from its image §x if $ is a (known) m x N 
Gaussian or Bernoulli matrix with high probability. Over the more general class 
of vectors x = Xd + Af that can be decomposed into a ci-sparse vector Xd (with d 
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presumably less than or equal to k) and additive noise vector Af, we might expect 
an intermediate estimate x s to be a better estimate to x than the final OMP output 
Xk, at least when d « k. Assuming that the signal x admits a decomposition of 
the form x = x^ + Af, the sequence of intermediate estimates (x±, x% ) of an OMP 
algorithm can be cross validated in order to estimate the noise level and recover 
a better approximation to x. We will study this particular application of cross 
validation in more detail below. 

3.7 Orthogonal matching pursuit: a case study 

As detailed in Table 2, a single index Aj is added to a set Aj estimated as the j most 
significant coefficients of x at each iteration j of OMP; following the selection of Aj, 
an estimate Xj to x is determined by the least squares solution, 

Xj = arg min ||$z — y|| 2 , (3.31) 

supp(z)eAj 

among the subspace of vectors z G WL N having nonzero coordinates in the index set 
Aj. OMP continues as such, adding a single index Aj to the set Aj at iteration j, 
until j — k at which point the algorithm terminates and returns the A;-sparse vector 
Xomp = Xk as approximation to x. 

Suppose x has only d significant coordinates. If d could be specified beforehand, 
then the estimate Xd at iteration j — d of OMP would be returned as an approxi- 
mation to x. However, the sparsity d is not known in advance, and k will instead be 
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an upper bound on d. As the estimate Xj in OMP can be then identified with the 
hypothesis that x has j significant coordinates, the application of cross-validation as 
described in the previous section applies in a very natural way to OMP. In particular, 
we expect x or and x cv of Proposition 3.5.1 to be close to the estimate Xj at index 
j = \x\ corresponding to the true sparsity of x; furthermore, in the case that \x\ is 
significantly less than k, we expect the cross validation estimate x cv to be a better 
approximation to x than the OMP- returned estimate xt- We will put this intuition 
to the test in the following numerical experiment. 

3.7.1 Experimental setup 

We initialize a signal xo of length iV = 3600 and sparsity level d = 100 as 

{1, for j = 1... 100 
(3.32) 
0, else. 

Noise is then added to x a = x + M a in the form of a Gaussian random variable Af a 
distributed according to 

N a ~ 7V(0,.05), (3.33) 

and the resulting vector x a is renormalized to satisfy ||^a||;^ = 1- This yields an 
expected noise level of 

E[a d (x a )] ^.284. (3.34) 
We fix the input k = 200 in Table 2, and assume we have a total number of com- 
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pressed sensing measurements m = 800. A number r of these m measurements 
are allotted to cross validation, while the remaining n — m — r measurements are 
allocated as input to the OMP algorithm in Table 2. This experiment aims to nu- 
merically verify Proposition 3.5.1; to this end, we specify a confidence £ = 1/100, 
and solve for the accuracy e according to the relation r = e~ 2 log(£); that is, 



Note that the specification (3.35) corresponds to setting the constant C = 1 in Propo- 
sition 3.5.1. Although C > 8 is needed for the proof of the Johnson Lindenstrauss 
lemma at present, we find that in practice C — 1 already upper bounds the optimal 
constant needed for Proposition 3.5.1 for Gaussian and Bernoulli random ensembles. 

A single (properly normalized) Gaussian n x N measurement matrix $ is generated 
(recall that n — m-r) , and this matrix and the measurements y = $x are provided 
as input to the OMP algorithm; the resulting sequence of estimates (xi,x 2 , •••,£&) is 
stored. The final estimate Xk from this sequence is the returned OMP estimate x omp 
to x. The error i] omp = \\x omp — x\\2 is greater than or equal to the oracle error of the 
sequence, r] or = mm £j \\x — Xj\\ 2 - 

With the sequence (xi, x 2 , Xk) at hand, we consider 1000 realizations of an 
r x N cross validation matrix having the same componentwise distribution as $, but 
normalized to have variance 1/r according to Theorem 3.4.1. The cross validation 




(3.35) 
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error 




(3.36) 



is measured at each realization \I/ 9 ; we plot the average fj^ v of these 1000 values and 
intervals centered at fj^ v having length equal to twice the empirical standard devi- 
ation. Note that we are effectively testing 1000 trials of OMP-CV, the algorithm 
which modifies OMP to incorporate cross validation so that {x cv ,r]^ v ) are output in- 

SijGctci of $omp ~~ *&k m 

At the specified value of £, Proposition 3.5.1 part (3.16) (with constant C — 1) 
implies that 



should obtain on at least 990 of the 1000 estimates rj cv (q); in other words, at least 
990 of the 1000 discrepancies \i] or — ffa,(q)\ should be bounded by 



Using the relation (3.35) between e and r, this bound becomes tighter as the number 
r of CV measurements increases; however, at the same time, the oracle error r\ or 
increases with r for fixed m as fewer measurements n — m — r are input to OMP. An 
ideal number r of CV measurements should not be too large or too small; Figure 1 
suggests that setting aside just enough measurements r such that e < .6 is satisfied 
in (3.35) serves as a good heuristic to choose the number of cross validation mea- 
surements (in Figure 1, e < .6 is satisfied by taking only r = 30 measurements). 



(l - e)r] or < r] cv (q) < (l + e)rj, 



'or 



(3.37) 



< \r]cv(q) ~ Vor\ < €71, 



(3.38) 
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We indicate the theoretical bound (3.37) with dark gray in Figure 1, which is com- 
pared to the interval in light gray of the 990 values of r] cv (q) that are closest to r\ or 
in actuality. 

This experiment is run for several values of r within the interval [5,90]; the re- 
sults are plotted in Figure 1(a), with the particular range r G [5,30] blown up in 
Figure 1(b). 

We have also carried out this experiment with a smaller noise variance; i.e. Xb = 
x Q + Mb is subject to additive noise 



The signal x\, is again renormalized to satisfy ||^b||^ = 1; it now has an expected 
noise level of 



M b ~ iV(0,.02). 



(3.39) 



E[<j d (x b )] « .116. 



(3.40) 



The results of this experiment are plotted in Figure 1(c). 
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Figure 3.2: Comparison of the reconstruction algorithms OMP and OMP-CV. We fix 
the parameters N = 3600, m = 800, k = 200, and underlying sparsity d = 100, but vary 
the number r of the total m measurements reserved for cross validation from 5 to 90, 
using the remaining n = m — r measurements for training. The underlying signal has 
residual cioo(^) ~ -284 in Figures 1(a) and 1(b), and aioo(x) « .116 in Figure 1(c), as 
shown for reference by the thin horizontal line. In both cases, the OMP-CV error r\ cv (the 
solid black line with error bars; each point represents the average of 1000 trials) gives a 
better approximation to the residual error than does OMP (dot-dashed line) with very 
high probability, even when as few as 20 of the total 800 measurements are used for cross 
validation. Even though r/ cv is guaranteed to provide a tighter bound for rj or as the number 
r of CV measurements increases, at the same time, the oracle error rj or becomes a worse 
indicator of the residual a"ioo(^) because fewer measurements n = m — r are input to OMP. 
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3.7.2 Experimental results 

1. We remind the reader that the cross-validation estimates rj^ v are observable to 
the user, while the values of r] omp , i] or , along with the noise level Cd(x), are not 
available to the user. Nevertheless, rj2, can serve as a proxy for r\ or according 
to (3.37), and this is verified by the plots in Figure 1. r\^ v can also provide an 
upper bound on <Jd(x), as is detailed in Section 5.1. 

2. The theoretical bound (3.37) is seen to be tight, when compared with the 
observed concentration bounds in Figure 1. 

3. With high probability, the estimate x cv (15) using r = 15 out of the alloted 
m = 800 measurements will be a better estimate of x than the OMP estimate: 
||a; c „(15) — x\\2 < ||x omp (15) — x||2. With overwhelming probability, the estimate 
x c „(30) will result in error ||x CT (30) — x\\ 2 < ||aj omp (30) — x|| 2 . We note that the 
estimates x cv {lh) and x cv (30) correspond to accuracy parameters e(15) = .8405 
and e(30) = .5943 in (3.35), indicating that e < .6 is a good heuristic to 
determine when enough CV measurements have been reserved. 

4. The OMP-CV estimate x cv will have more pronounced improvement over the 
OMP estimate x omp when there is larger discrepancy between the true sparsity d 
of x and the upper bound k used by OMP (in Figure (1), d — 100 and k = 200). 
In contrast, OMP-CV will not outperform OMP in approximation accuracy 
when d is close to k; however, the multiplicative relation (3.37) guarantees 
that OMP-CV will not underperform OMP, either. 
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3.8 Beyond compressed sensing 

The Compressed Sensing setup can be viewed within the more general class of under- 
determined linear inverse problems, in which x G WL N is to be reconstructed from a 
known mx N underdetermined matrix A and lower dimensional vector y = Ax using 
a decoding algorithm A : IR m — > M. N ; in this broader context, A is given to the user, 
but not necessarily specified by the user as in compressed sensing. In many cases, a 
prior assumption of sparsity is imposed on x, and an iterative decoding algorithm 
for solving the LASSO problem (3.28) will be used to reconstruct x from y [30]. If 
it is possible to take on the order of r = logp additional measurements of x by an 
r x N matrix \1> satisfying the conditions of Lemma 3.4.1, then all of the analysis 
presented in this paper applies to this more general setting. In particular, the error 
||x — Xjl^N at up to j < p successive approximations Xj of the decoding algorithm 
A may be bounded from below and above using the quantities \\^(x — Xj)\\gr, and 
the final approximation x to x can be chosen from among the entire sequence of esti- 
mates £j as outlined in Proposition 3.5.1; an earlier estimate Xj may approximate x 
better than a final estimate x p which contains the artifacts of parameter overfitting 
occurring at later stages of iteration. 

3.9 Extensions 

We have presented an alternative approach to compressed sensing in which a certain 
number r of the m allowed measurements of a signal x G M. N are reserved to track the 
error in decoding by the remaining m — r measurements, allowing us to choose a best 
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approximation to x in the metric of out of a sequence of p estimates (xj)^ =1 , and 
estimate the error between x and its best approximation by a /c-sparse vector, again 
with respect to the metric of ■ We detailed how the number r of such measurements 
should be chosen in terms of desired accuracy e of estimation, confidence level £ in 
the prediction, and number p of decoding iterations to be measured; in general, 
r = O(log(jo)) measurements suffice. Several important issues remain unresolved; we 
mention only a few below. 

1. Noisy measurements. Consider the noisy measurement model, 

y = Ax + Af, (3.41) 

where x G R N , y G M M , and J\f ~ N(0, a 2 ) is a Gaussian random variable 
that accounts for both noise and quantization error on the measurements Ax. 
Because measurement noise and quantization error are unavoidable in any real- 
world sensing device, any proposed compressed sensing technique should extend 
to the model (3.41). Cross validation is studied in [11] in the context of noisy 
measurements, assuming that x is truly sparse and that Af ~ N(0, a 2 ) is Gaus- 
sian noise. The experimental results in [11] indicate that cross validation works 
well in the presence of noise, and Proposition 3.5.1 provides intuition as to why. 
Suppose that x G M, N is /c-sparse, so that the best possible reconstruction to x 
using n noisy measurements y\ = $x + Mi of the m total noisy measurements 



125 



y = Ax + Af is bounded by the £ 2 norm of M\. 

\\ x ~ x \\i» ~ E[||M||^] = &Vn- (3.42) 

Alternatively, the noise vector M cv corresponding to the remaining r = m — 
n cross validation measurements has exponentially smaller expected £ 2 norm 
E[||A/" cl ,||^] = ay/10 logn, assuming that r = 10 log n. It follows by application 
of the triangle inequality 

(1 - e)\\x - x\\gr - oVlogn < \\^(x - x) + M cv \\r 2 

< (1 + e)\\x - x\\e 2 + ay/logn, 

whence the approximation error \\x — ~ o~y/n dominates both lower and 
upper bounds on the noisy cross validation error. 

2. Other cross validation techniques. The cross validation technique pro- 
moted in this paper corresponds in particular to the technique of holdout cross 
validation in statistics, where a data set is partitioned into a single training 
and cross validation set (as a rule of thumb, the cross validation set is usually 
taken to be less than or equal to a third of the size of the training set; in the the 
current paper, we have shown that the Johnson Lindenstrauss lemma provides 
a theoretical justification of how many, or, more precisely, how few, cross vali- 
dation measurements are needed in the context of compressed sensing). Other 
forms of cross validation, such as repeated random subsampling cross validation 
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or K-fold cross validation, remain to be analyzed in the context of compressed 
sensing. The former technique corresponds to repeated application of holdout 
cross validation, with r cross validation measurements out of the total m mea- 
surements chosen by random selection at each application. The results are then 
averaged (or otherwise combined) to produce a single estimation. The latter 
technique, K-io\d cross validation, also corresponds to repeated application of 
holdout cross validation. In this case, the m measurements are partitioned into 
K subsets of equal size r, and cross-validation is repeated exactly K times with 
each of the K subsets of measurements used exactly once as the validation set. 
The K results are again combined to produce a single estimation. Although 
Proposition 3.5.1 does not directly apply to these cross validation models, the 
experimental results of Section 6 suggest that, equiped with an m x N matrix 
satisfying the requirements of Lemma 3.4.1, the application of K fold cross 
validation to subsets of the measurements of size r << m — r just large enough 
that e > in Proposition 3.5.1 for fixed accuracy £ and constant C — 1 can be 
combined to accurately approximate the underlying signal with near certainty. 

3. Cross validation in different reconstruction metrics. We have only 
considered cross validation over the metric of £ 2 - However, the error \\x — 
x\\ e N, or root mean squared error, is just one of several metrics used in image 
processing for analyzing the quality of a reconstruction x G M. N to a (known) 
image x G WL N . In fact, the £i reconstruction error \\x — x\\ £ n has been argued 
to outperform the root mean squared error as an indicator of reconstruction 
quality [37]. Unfortunately, Proposition 3.5.1 cannot be extended to the metric 



127 



of £i, as there exists no l\ analog of the Johnson Lindenstrauss Lemma [25]. 
However, it remains to understand the extent to which cross validation in 
compressed sensing can be applied over a broader class of image reconstruction 
metrics, perhaps using more refined techniques than those considered in this 
paper. 

4. Cross validation with more general CS matrix ensembles. Many more 
compressed sensing matrices in addition to those satisfying the requirements 
of Lemma 3.4.1 can be used for cross validation purposes, in the sense that 
a concentration bound similar to (3.4.1) will hold for most input x. Indeed, 
it has been recently shown [39] that such random partial Fourier matrices, if 
multiplied on the right by an N x N diagonal matrix of independent and iden- 
tically distributed Bernoulli random variables, will satisfy the concentration 
bound (3.4.1) with slightly larger number of measurements r > e~ 2 log 3 ^. 
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Chapter 4 

Free discontinuity problems meet 
iterative thresholding 

4.1 Introduction 

Free-discontinuity problems describe situations where the solution of interest is de- 
fined by a function and a lower dimensional set consisting of the discontinuities of 
the function. Hence, the derivative of the solution is assumed to be a 'small' function 
almost everywhere except on sets where it concentrates as a singular measure. This 
is the case, for instance, in crack detection from fracture mechanics or in certain 
digital image segmentation problems. If we discretize such situations for numerical 
purposes, the free-discontinuity problem in the discrete setting can be re-formulated 
as that of finding a derivative vector with small components at all but a few en- 
tries that exceed a certain threshold. This problem is similar to those encountered 
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in Compressed sensing, where vectors with a small number of dominating compo- 
nents in absolute value are recovered from a few given linear measurements via the 
minimization of related energy functionals. Several iterative thresholding algorithms 
that intertwine gradient-type iterations with thresholding steps have been designed 
to recover sparse solutions in this setting. It is natural to wonder if and/or how such 
algorithms can be used towards solving discrete free-discontinuity problems. The 
current chapter explores this connection, and, by establishing an iterative thresh- 
olding algorithm for discrete free-discontinuity problems, provides new insights on 
properties of minimizing solutions thereof. 

4.1.1 Free-discontinuity problems: the Mumford-Shah func- 
tional 

The terminology 'free-discontinuity problem' was introduced by De Giorgi [36] to in- 
dicate a class of variational problems that consist in the minimization of a functional, 
involving both volume and surface energies, depending on a closed set K C M. d , and 
a function u on W 1 usually smooth outside of K. In particular, 

• K is not fixed a priori and is an unknown of the problem; 

• K is not a boundary in general, but a free-surface inside the domain of the 
problem. 



130 



The best-known example of a free-discontinuity problem is the one modelled by the 
so-called Mumford-Shah functional [56] , which is defined by 



The set is a bounded open subset of R d , a, (5 > are fixed constants, and 
g G L°°(Q). Here H N denotes the iV-dimensional Hausdorff measure. Through- 
out this paper, the dimension of the underlying Euclidean space IR d will always be 
d — 1 or d — 2. In the context of visual analysis, g is a given noisy image that 
we want to approximate by the minimizing function u G W 1,2 (Q \ K); the set K is 
simultaneously used in order to segment the image into connected components. For 
a broad overview on free-discontinuity problems, their analysis, and applications, we 
refer the reader to [3] . 

If the set K were fixed, then the minimization of J with respect to u would be a 
relatively simple problem, equivalent to solving the following system of equations: 



where v is the outward-pointing normal vector at any x G dQ U K. Therefore the 
relevant unknown in free-discontinuity problems is the set K. Ensuring the existence 
of minimizers (u, K) of J is a challenging problem because there is no topology on 
the closed sets that ensures 




Au 



a(u-g), 



mQ\K, 



du 
dv 



on dtt U K, 
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(a) compactness of minimizing sequences and 

(b) lower semicontinuity of the Hausdorff measure. 

Indeed, it is well-known, by the direct method of calculus of variations [28, Chap- 
ter 1], that the two previous conditions ensure the existence of minimizers. How- 
ever, the problem becomes more manageable if we restrict our domain to functions 
u G BV(Q) n W 1,2 (Vt \ K), and make the identification K = S u where S u is the well- 
defined discontinuity set of u. In this case, we need to work only with a topology on 
the space BV(Q) of bounded variation, and no set topology is anymore required. 

Unfortunately the space BV(Q) is 'too large'; it contains Cantor-like functions whose 
approximate gradient vanishes, Vw = 0, almost everywhere, and whose discontinuity 
set has measure zero, l-L d ~ 1 (S u ) = 0. As these functions are dense in L 2 (il), the 
problem is trivialized; see [3] for details. 

Nevertheless, it is possible to give a meaningful formulation of the functional J 
if we exclude such functions and restrict J to the space SBV(Q) constituted of BV- 
functions with vanishing Cantor part. If we assume again K = S u , the solution can 
be recast as the minimization of 



Jn\s u 

The existence of minimizers in SBV for the functional (4.1) was established by 
Ambrosio on the basis of his fundamental compactness theorem in [2], see also [3, 





(4.1) 
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Theorem 4.7 and Theorem 4.8]. 

4.1.2 T-convergence approximation to free-discontinuity prob- 
lems 

The discontinuity set S u of a SBV-f unction u is not an object that can be easily han- 
dled, especially numerically. This difficulty gave rise to the development of approxi- 
mation methods for the Mumford-Shah functional and its minimizers where sets are 
no longer involved, and instead substituted by suitable indicator functions. In order 
to understand the theoretical basis for these approximations, we need to introduce 
the notion of T-convergence, which is today considered one of the most successful 
notions of 'variational convergence'; we state only the definition of T-convergence 
below, but refer the reader to [28, 13] for a broad introduction. 

Definition 4.1.1. Let (X,d) be a metric space 1 and let /, f n : X — > [0,oo] be 
functions for n e N. We say that (f n ) ne ^ Y -converges to f if the following two 
conditions are satisfied: 

i) for any sequence (x n ) n C X converging to x, 

liminf f n (x n ) > f(x); 

n 

1 Observe that by [28, Proposition 8.7] suitable bounded sets X endowed with the weak topology 
induced by a larger Banach space are indeed metrizable, so this condition is not that restrictive. 
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ii) for any x e X, there exists a sequence (x n ) n C X converging to x such that 

limsup/ n (x n ) < f(x). 

n 

One important consequence of Definition 4.1.1 is that if a sequence of functionals 
/„ T-converges to a target functional /, then the corresponding minimizers of f n also 
converge to minimizers of /, see [28, Corollary 7.30]. 

We define now 

F £ (u,v):=J [v 2 \Vu\ 2 + a(u-g) 2 ]+^e\Vv\ 2 + ^—^Jdx (4.2) 
over the domain (L 2 (Q)) 2 , along with the related functional 



FJu,v) , if v e W 1 ' 2 ^), uv G W^ 2 (Q), and < v < 1, 
J £ (u,v) := { (4.3) 

oo , else. 



Note that at the minimizer (u, v) of J~ e , the function < v < 1 tends to indicate the 
discontinuity set S u of the functional (4.1) as e — > 0. In [4] Ambrosio and Tortorelli 
proved the following T-approximation result: 

Theorem 4.1.2 (Ambrosio- Tortorelli '90). For any infinitesimal sequence (e n ) n , the 
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functional J £n {u,v) T-converges in (L 2 (f2)) 2 to the functional 



J(u,v) : 



J(u) , ifv = l, 



(4.4) 



oo , otherwise. 



4.1.3 Discrete approximation 

In fact, the Mumford-Shah functional is the continuous version of a previous discrete 
formulation of the image segmentation problem proposed by Geman and Geman in 
[47]; see also the work of Blake and Zisserman in [9]. Let us recall this discrete 
approach. For simplicity let d — 2 (as for image processing problems), VL = [0, l] 2 , 
and let Uij = u(hi, hj), G Z 2 be a discrete function defined on Qh '■= & H hi?, 
for h > 0. Define Wh(t) = min{t 2 , /3/h} to be the truncated quadratic potential, and 



(hi,hj)en h 

Chambolle [20, 21] gave formal clarification as to how the discrete functional J 



approximates the continuous functional J of Ambrosio: discrete sequences can be 
interpolated by piecewise linear functions in such a way as to allow for discontinuities 
when the discrete finite differences of the sampling values are large enough. On the 
basis of this identification of discrete functions on Q h and functions defined on the 
'continuous domain' Q, we have the following result: 






(4.5) 
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Theorem 4.1.3 (Chambolle '95). The functional J 



Y -converges in B(Q) (the 



space of Borel-measurable functions, which is metrizable, see [21] for details) to 



as h — >■ 0, where C is the so-called 'cab-driver' measure defined below. 

Basically C measures the length of a curve only through its projections along 
horizontal and vertical axes; for a regular C l curve c = 7([0, 1]), with ^(t) = 
(li(t),l2(t)) e ^, we have 



The reason this anisotropic (or, direction dependent) measure appears, in place of the 
Hausdorff measure in the Mumford-Shah functional, is due to the approximation of 
derivatives by finite differences defined on a 'rigid' squared geometry A discretization 
of derivatives based on meshes adapted to the morphology of the discontinuity indeed 
leads to precise approximations of the Mumford-Shah functional [22, 12]. 

4.1.4 Free-discontinuity problems and discrete derivatives 

In the literature, several methods have been proposed to numerically approximate 
minimizers of the Mumford-Shah functional [8, 12, 20, 21, 54]. In particular, a re- 
laxation algorithm, based essentially on alternated minimization of a finite element 
approximation of the Ambrosio and Tortorelli functional (4.3), leads to iterated so- 
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lutions of suitable elliptic PDEs, where the differential part includes the auxiliary 
variable v which encodes and indicates information about the discontinuity set. These 
implementations are basically finite dimensional approximations to the following al- 
gorithm: Starting with = 1, iterate 



However, neither has a proof of convergence of this iterative process to its stationary 
points been explicitly provided in the literature, nor have the properties of such 
stationary points been investigated, especially in case of genuine inverse problems 
(see the discussion in Subsection 4.1.4). 

In this paper, we take a different approach and investigate how minimization of 
the T-approximating discrete functional (4.5) can be implemented efficiently by iter- 
ative thresholding on the discrete derivatives. Unlike the aforementioned approach, 
we will be able to provide a rigorous proof of convergence to stationary points, which 
coincide with local minimizers of the discrete Mumford-Shah functional. Moreover, 
we are able to characterize stability properties of such stationary points, and demon- 
strate the stability of global minimizers of the discrete Mumford Shah functional. 

Let us recall: the solutions u of a free-discontinuity problem are supposed to 
be smooth out of a minimal ipersurface K. This means that the distributional 
derivative of u is a 'small function' everywhere except on K where it coincides with a 
singular measure. In the discrete approximation (4.5), the vector of finite differences 
(wj) = ( Ui 'j +1 ~ Ul ' j ; u i+ij- u iJ ) corresponds to a piecewise constant function that is 
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small everywhere except for a few locations, corresponding to \wj\ > y//3/h, that 
approximate the discontinuity set K. So, in terms of derivatives, solutions of (4.5) 
are vectors having only few large entries. In the next section, we clarify how we can 
indeed work with just derivatives and forget the primal problem. 

The 1-D case 

Let us assume for simplicity that the dimension d — 1, the domain Q = [0, 1], and 
the parameters a — j3 — 1. Denote by Ui = u(hi) a discrete function defined on 
hi E Qfi '■= ^ H hZ, for h > 0; note that the vector («;) G W 1 for n — [l/h\. In this 
setting, the discrete functional (4.5) reduces to 

+ h ^ (ui- gi) 2 , 
(hi)en h 

where we recall that Whit) = min{t 2 , l/h}. Since no geometrical anisotropy is now 
involved (d — 1), it is possible to show that this discrete functional T-converges pre- 
cisely to the corresponding Mumford-Shah functional on intervals [20]. 

For (ui) hi< =Q h we define the discrete derivative as the matrix D h : IR n — > W 1 ^ 1 that 
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maps (ui) hi£nh into ( Ur+ \ "' )., given by 



h 



I -1 1 

0-11 ... 

-1 1 



(4.6) 



It is not too difficult to show that 



u = D\D h u + c, 

where D\ is the pseudo-inverse matrix of D h (in the Moore-Penrose sense; note that 
Dl maps M n_1 into W l and is an injective operator) and c is a constant vector which 
depends on u, and the values of its entries coincide with the mean value h Y2hien h M * 
of u. Therefore, any vector u is uniquely identified by the pair (D h u,c). 



Since constant vectors comprise the null space of D^, the orthogonality relation 
{D^DhU, c)q = holds for any vector u and any constant vector c. Here the scalar 
product (•, -)gn = ^^iVi is the standard Euclidean scalar product, which induces 
the Euclidean norm := {J2i u 'i) 1 ^ 2 - Using this orthogonality property, we have 

that 



u ~9\\q = \\ D l D hU- D[D h g + (c - c g )\\ 2 e n 

= \\DlD h u-DlD h g\\ 2 q + \\c-c g \\ 2 e n 
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Hence, with a slight abuse of notation, we can reformulate the original problem in 
terms of derivatives, and mean values, by 

<?i/Vh( z , c ) = h \\ D h z + h W c ~ c g\\% + h Yl min \ \ Zi 



where z = D h u and / = D\D h g. Of course at the minimizer u we have c = c g , since 
this term in J^j^ does not depend on z. Therefore, ||c — c g \^ does not play any role 
in the minimization and can be neglected. Once the minimal derivative vector z is 
computed, we can assemble the minimal u by incorporating the mean value of g as 
follows: 

U = D{Z + Cg. 

The 2-D case, discrete Schwartz conditions, and constrained optimization 

Let us assume now d = 2, Q = [0, l] 2 , and again a = (3 = 1. Denote u i; j = u(hi, hj), 
G Z 2 , a discrete function defined on £l h := f2 fl hZ 2 , n = [l/h\ , and 

(hi,hj)£Qh 

+ h 2 Yl w * 

(hi,hj)£Q h 
(hi,hj)&n h 



h 



u i,j+l u i,j 
h 
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In two dimensions, we have to consider the derivative matrix : IR" 2 — > ]R 2n ( n_1 ) 
that maps the vector (uj + ^-i^ n ) := (uij) to the vector composed of the finite differ- 
ences in the horizontal and vertical directions u x and u y respectively, given by 



D h u := 



u x 




Uy 





( u x)j+n(i-l) '■— ( u x)i,j '■ — 



u i + l,j u i,j ■ _ 

h ' ' - 



= l,...,n- l,j = l,...,n 
,i = l,...,n,j = l,...,n- 1 



Note that its range R(D h ) C R 2 "(" _1 ) is a (n 2 — 1) -dimensional subspace because 
DhC = for constant vectors c G IR™ 2 . Again, we have the differentiation-integration 
formula, given by 

u = D\p h u + c, 

where D\ is the pseudo-inverse matrix of D h (in the Moore-Penrose sense); note 
that D^ h maps R{Dh) injectively into IR™ 2 . Also, c is a constant vector that depends 
on u, and the values of its entries coincide with the mean value h 2 Yl(hi hj)€n h u h3 °f u - 



Proceeding as before and again with a slight abuse of notation, we can reformu- 
late the original discrete functional (4.5) in terms of derivatives, and mean values, 
by 

J 1/VK (z,c) = ^[||4 z _ / ||^ 2 + || c _ Cg ||2^ + ^ min || z ..|2 ) l|]_ 

where z = D h u G IR 2 ™^ -1 ), and / = D\D h g G IR" 2 . Of course c = c g is again 
assumed at the minimizer u, since this latter term in <Ji/^/j l does not depend on z. 
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However, in order to minimize only over vectors in R 2n ( n x ) that are derivatives of 
vectors in IR™ 2 , we must minimize <Ju^/ji(z, c) subject to the constraint DhD^z = z. 



u i,j + l (U x )i,j + i U i + i,j + 1 



(%) i,] 



(Uy)i + 1, j 



U i, 



(Ux)i,j u i + l,j 



Figure 4.1: Compatibility conditions of derivatives in 2D. 



The 2n(n — 1) linearly independent constraints D^D\z = z are equivalent to the 

discrete Schwartz constraints 2 : , 

(Uy)ij + (U x )ij+1 = (Uy)i+l,j + (U x )ij, (4.7) 



that establish the equivalence of the length of the paths from Uij to ■Uj+ij+i, whether 
one moves in vertical first and then in horizontal direction or in horizontal first and 
then in vertical direction (see Figure 4.1). 



In short, we arrive at the following constrained optimization problem: 

2 These discrete conditions correspond to the well-known Schwartz mixed derivative theorem for 
which d xy u — d yx u for any u € C 2 (0). 
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Minimize J 1/VE (z) = h" [\\Tz - f\\] f + E M min {|^| 2 , \) ] . 

(4.8) 

subject to Qz = 0, 

for T = D\ and Q = X — D h D\. Once the minimal derivative vector z is computed, 
we can assemble the minimal u by incorporating the mean value of g as follows: 

U = D[Z + Cg. 

Regularization of inverse problems by means of the Mumford-Shah con- 
straint 

The Mumford-Shah regularization term 

MS(u)= ( \Vu\ 2 + f3U d -\S u ), (4.9) 
Jn\s u 

has been used frequently in inverse problems for image processing [41, 62], such 
as inpainting and tomographic inversion. Despite the successful numerical results 
observed in the aforementioned papers for the minimization of functionals of the 
type 

J(u) = a\\Ku - g\\ L 2 {Q) + MS(u), (4.10) 

where K : L 2 (f2) — > L 2 (il) is a bounded operator which is not boundedly invertible, 
no rigorous results on existence of minimizers are currently available in the literature. 
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Indeed, the Ambrosio compactness theorem [2] used for the proof of the case K = I 
does not apply in general. A few attempts towards using the regularization MS 
for inverse problems in fracture detection appear in the work of Rondi [63, 64, 65], 
although restrictive technical assumptions on the admissible discontinuities of the 
solutions are required. 

As one of the contributions to this paper, we show that discretizations of regularized 
functionals of the type (4.10) always have minimizers (see Theorem 4.2.2). More 
precisely, these discretizations correspond to functionals of the form, 

J V W^ U) := ah2 \\ Ku -9\\i+h 2 

(hi,hj)£Q h 

(4.11) 

and we prove that such functionals admit minimizers. Note that the discrete Mumford- 
Shah approximation (4.5) can be written in this form. We go on to show that such 
minimizers can be characterized by certain fixed point conditions, see Theorem 4.6.1 
and Theorem 4.6.2. As a consequence of these achievements we can prove that global 
minimizers are always isolated, although not necessarily unique, whereas local min- 
imizers may constitute a continuum of unstable equilibria. Hence, our analysis will 
shed light on fundamental properties, virtues, and limitations, of regularization by 
means of the Mumford-Shah functional MS, and provide a rigorous justification of 
the numerical results appearing in the literature. 

It is useful to show how the discrete functional (4.11) can be still expressed in terms 



u i+l,j ~~ u 



1,3 



h 



+ w h 



h 
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of the sole derivatives for general K. As done before in the case K — I, and with 
the now usual identification u = (DhU, c), we can rewrite the functional in terms of 
derivatives and mean value as follows: 



J^j- h {z,c) = h 2 a\\KDiz-(g-Kc)\\l + h 2 J2^l\z t J\'^, (4.12) 



Note that in general we cannot anymore split orthogonally the discrepancy \\KD h z — 
(g — Kc) HI into a sum of two terms which depend only on derivatives z and mean 
value c respectively Nevertheless, for fixed z, it is straightforward to show that 
c = argmin ci 7 fjjr:{ z i c ) depends on z via an affine map. Indeed we can compute 



where 1 is the constant vector with entries identically 1. Here we assume that 
1 ker K, that is a necessary condition in order to be able to identify the mean value 
of minimizers (a similar condition is required anytime we deal with regularization 
functionals which depend on the sole derivatives, see, e.g., [23, 73]). By substituting 
this expression for c into (4.12), it is clear that the minimization of functionals (4.11) 
can be reformulated, in terms of the sole derivatives, as constrained minimization 
problems of the form (4.8). 
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4.2 Existence of minimizers for a class of discrete 
free-discontinuity problems 

In light of the observations above, we can transform the problem of the minimization 
of functionals of the type (4.10), by means of discretization first and then reduction 
to sole derivatives, into the (possibly, but not necessarily) constrained minimization 
problem: 



Minimize J r (u) = [\\Tu - g\\ 2 M + £ii min {\ Ui \ 2 , r 2 } ] . 

2 (4-13) 

subject to Qu = 0. 



Our first result ensures the existence of minimizers for the constrained optimiza- 
tion problem (4.13): 

Proposition 4.2.1. Assume r > 0, and fix linear operators T : M. N — > IR M and 

Q : M. N — > IR M ' ; which are identified in the following with their matrices with respect 
to the canonical bases. We also fix g G M M . The constrained minimization problem 



Minimize J r (u) = [\\Tu - g\\ 2 M + min {\ui\ 2 , r 2 } ] 

2 (4.14) 

subject to Qu = 0. 



has minimizers u* . 

Proof. We begin by noting that inf q u=0 J r {u) is well-defined and finite, since J r > 
is bounded from below. It remains to show that there exists a vector u* that 
satisfies J r (u*) = inf ueR jv J r {u). Towards this goal, consider the following partition 
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V = {Wj j }| =1 of W N indexed by the subsets Ij of the index set X = {1, 2, N}, as 
follows: 

U Xj := {u G M. N : \ui\ <r,ie X j: \m\ > r,i E T/Xj}. (4.15) 

The minimization of J r subject to Qu = and constrained to the closure of the 
subset Ux i can be reformulated as a quadratic optimization problem, for which the 
classical Frank- Wolfe theorem [7] guarantees the existence of a minimizer u(Xj). Now, 
since R N = Ujlj, the minimal value of J r subject to Qu = and over all of R N is 
just the minimal value from the finite set {J r {u{Xj)) : j — 1, . . . , 2^}; that is, 

mm J r (u) = mmJriuilj)) 

Qu=0 Xj CX 

and u* = arg min QM=0 J r (u) = u{ arg min XjCX J r (u(lj))) . □ 

In fact, Proposition 4.2.1 extends to a much larger class of free-discontinuity 
type minimization problems; by the same reasoning as before, we arrive at the more 
general result: 

Theorem 4.2.2. The constrained minimization problem 

Minimize J?(u) = [\\Tu - g\\ 2 eM + £ti min {\ Ui \P , r?} ] 



N 

lllllHI/M' . r i 

(4.16) 



subject to Qu = 0. 

has minimizers u* for any real-valued parameter p > 1. 

The Frank- Wolfe theorem, which guarantees the existence of minimizers for quadratic 
programs with bounded objective function, does not apply to the general case p > 1 
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where the objective function J? is not necessarily quadratic. Nevertheless, with the 
following generalization for the Frank- Wolfe theorem, Theorem 4.2.2 follows directly 
from a similar argument as for Proposition 4.2.1. 

Proposition 4.2.3. Suppose A is an NxN positive semidefinite matrix, and suppose 
b and c are Nxl vectors. Suppose also that X is a nonempty convex polyhedral subset 
ofM. N . The convex optimization problem 



admits minimizers for any real parameter p > 1, as long as the objective function is 
bounded from below. 

For ease of presentation, we reserve the proof of Proposition 4.2.3 to the Ap- 
pendix. 

From the proof of Theorem 4.2.2, one could in principle obtain a minimizer for J7j? by 
computing a minimizer u(Xj) for each subset Xj C X using a quadratic program solver 
[7], and then minimizing J* over the finite set of points {u(Xj)}. Unfortunately, this 
algorithm is computationally infeasible as the number of subsets of the index set 
{1, 2, N} grows exponentially with the dimension iV of the underlying space. 

Indeed, the minimization problem (4.16) is NP hard, as the known NP-complete 
problem SUBSET-SUM can be reduced to this problem. A complete discussion about 
the NP-hardness of (4.16) can be found in [1] . 




(4.17) 
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4.3 An iterative thresholding algorithm for 1-D 
free-discontinuity inverse problems 

4.3.1 Overview of the algorithm 

In this section, we introduce an algorithm that is guaranteed to converge to a local 
minimizer of the real- valued functional J? : £2(1) — > K having the form 

J?(u) = \\Tu-g\\l {K) + £min{k|V P }, (4.18) 

subject to the conditions: 

• X and JC are countable sets of indices, and T : £2(1) — > ^(£) is a bounded 
linear operator, which is in the following identified with its matrix associated 
to the canonical basis; 

• the operator T has spectral norm ||T|| < 1. Note that this requirement is easily 
met by an appropriate scaling for the functional, i.e., we may have to consider 
instead 

J?(u)=i\\Tu-g\\lw+^mm{\u i \>,r*}, 7 < 1. 

This modification leads to minor changes in the analysis that follows (see also 
Subsection 4.7.2), and throughout this paper we assume, without loss of gen- 
erality, that 7 = 1; 
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• the parameter p is in the range 1 < p < 2. In case the index set X is finite, 
only the restriction p > 1 is necessary. 

We note that the scaled ID discrete Mumford-Shah functional \J' 1 / y /% is clearly a 
functional of the form (4.18) having r = 1/Vh, index set X = {1, . . . , [r 2 \ }, parame- 
ter p = 2, and operator T = D^ r2 : IR^-M — > RL r2 J . As shown in the Appendix, the 
operators D^ r2 satisfy the uniform bound ||-D{/ r2 || < 1/2, independent of dimension, 
so a scaling factor is not needed in this case. 

In the following, we will not minimize J$ directly. Instead, we propose a majorization- 
minimization algorithm for finding solutions to J* , motivated by the recent appli- 
cation of such algorithms for minimizing energy functionals arising in sparse signal 
recovery and image denoising [10, 30]. More precisely, consider the following surro- 
gate objective function, 

J r ^(u, a) := J?(u) -\\Tu- Ta\\\ {K) + \\u - a\\ {xy u, a E £ 2 (X). (4.19) 

The surrogate functional J^ surr satisfies J^ surr (u,a) > J?(u) everywhere, with 
equality if and only if u = a, and is such that the sequence 

u n+l = arg min J r p ' surr (u, u n ) (4.20) 

u 

obtained by successive minimizations of jv ,surr {u,a) in u for fixed a results in a 
nonincreasing sequence of the original functional J^{u n ) (see Lemmas 4.3.1 and 
4.3.2). We will study the implementation and the convergence properties of the 
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iteration (4.20) as follows: 

• in Section 3.2, we review the standard properties of majorization-minimization 
iterations, 

• in Section 3.3, we explicitly compute w-global minimizers of the surrogate func- 
tional JP' surr (u,a), for a fixed; 

• in Section 4.1 we discuss connections between the resulting thresholding func- 
tions and thresholding functions used in compressed sensing, 

• in Sections 5.1, 5.2, and 5.3, we show that the sequence (w n ) ng N defined by 
(4.20) will converge to a stationary value u = argmin u J7~ r p ' s " rr (w, u), starting 
from any initial value u° for which J^{u°) < oo, 

• in Section 5.4, we show that such stationary values u are also local minimizers 
of the original functional J7"? that satisfy a certain fixed point condition, and 

• in Section 5.5, it is shown that any global minimizer of J$ is among the set of 
possible fixed points u of the iteration (4.20). 

By means of the thresholding algorithm, we also show that global minimizers of the 
functional are isolated, and moreover possess a certain segmentation property 
that is also shared by fixed points of the algorithm. 
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4.3.2 Preliminary lemmas 

The lemmas in this section are standard when using surrogate functionals (see [30] 
and [10]), and concern general real- valued surrogate functionals of the form 

?™tu, a) = T{u) -\\Tu- Ta\\l (lc) + \\u - a\\l (x) . (4.21) 

The lemmas in this section hold independent of the specific form of the functional 
T : £2(1) — > but do rely on the restriction that ||T|| < 1. 

Lemma 4.3.1. If the real-valued functionals J 7 (u) and J rsurr (u,a) satisfy the relation 
(4.21) and the sequence (u n ) ne ^ defined by u n+1 = argmin ue ^ 2 (x) J- surr (u, u n ) is ini- 
tialized in such a way that J-{u Q ) < 00, then the sequences J(tt n ) and J csurr (u n+1 , u n ) 
are non-increasing as long as \\T\\ < 1. 

Proof Since ||T|| < 1, also ||T*T|| < 1, and so the operator L = \fl — T*T is a 
well-defined positive operator whose spectrum is contained within a closed interval 
[c, 1] that is bounded away from zero c > 0. We can then rewrite J rsurr (u n+1 u n ) as 
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F surr (u n+1 ,u n ) = F{u n+1 ) + \\L{u n+l - u n )\\\ {xv from which it follows that 
Hu n+1 ) < Hu n+1 ) + \\L(u n+1 -u")\\l (I) 

= r urr (u n+ \u n ) 

< T surr (u n ,u n ) 

= Hu n ) 

< F{u n ) + \\L(u n - u n ~ l )\\l (X) 

= r^K,/" 1 ), (4.22) 

where the second inequality follows from u n+1 being a minimizer of J rsurr (u, u n ). □ 
From Lemma 4.3.1 we obtain the following corollary: 

Lemma 4.3.2. As long as the conditions of Lemma 4.3.1 are satisfied, one can 
choose N G N sufficiently large such that for all n > N, \\u n+1 — u n \\i 2 (x) < e, i-e., 

lim \\u 1l+1 -u n \\e 2( x) = 0. 



Proof. From Lemma 4.3.1, it follows that J r (u n ) > is a nonincreasing sequence, 
therefore it converges, and J(u n ) — J r (-u n+1 ) — > for n — > oo. The lemma follows 
from (4.22), and the estimates 

Hu n )-Hu n+1 ) > \\L(u n+1 -u n )\\l {x) > (1 - \\Tf)\\u n+1 -u n \\l { xy 
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□ 



4.3.3 The surrogate functional jv ,surr , its explicit minimiza- 
tion, and a new thresholding operator 

It is not immediately clear that the surrogate functional JP' surr in (4.19) is any easier 
to manage than its parent functional 3? ■ However, expanding the squared terms on 
the right hand side of (4.19), 3f ,surr (u : a) can be equivalently expressed as 

3 r p ' surr (u,a) = \\u-(I-T*T)a + T*g\\l {x) + Y,^HK\ p ,r p } + C 

iei 

= ^[(^-[a-T*ra + r*^) 2 + min{|^| p ,r p }] +C, 

■iex 

where the term C = C(T,a,g) depends only on T, a and g. Indeed, unlike the 
original functional J7jf, the surrogate functional 3^ surr decouples in the variables Ui, 
due to the cancellation of terms involving ||Tm||^ 2 . Because of this decoupling, global 
M-minimizers of 3f' surr (u, a), for a fixed, can be computed component- wise according 
to 



Ui = argmm 



(t- [a- T*Ta + T*g] i ) 2 + min{|t| p ,r p } , i E I. (4.23) 



One can solve (4.23) explicitly when e.g. p = 2, p = 3/2, and p — 1; in the general 
case p > 1, we have the following result: 

Proposition 4.3.3 (Minimizers of 3r ,surr (u,a) for a fixed). . 

1. Ifp > 1, the minimization problem u = argmin ue £ 2 (j) 3r' surr ( u , a ) can be solved 
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component-wise by 



Ui 



H^a-VTa + Vg^), iEl, 



(4.24) 



where #(p, r ) : K — > M. is the 'thresholding function', 



F-\\), \\\<X'(r,p) 
A, |A|>A'(r,p). 



(4.25) 



Here, F p X (A) is the inverse of the function F p (t) — t + | sgnt|t| p , and X' : 
A'(r,p) G (r, r + |r p_1 ) is £/ie unique positive value at which 



(4.26) 



2. W^/ien p = 1, the general form (4.24) still holds, but we have to consider two 



cases: 



(a) If r > 1/4, the thresholding function #(i, r ) : K. — >■ K. satisfies 



#(l,r)(A) 



0, |A| < 1/2 

(|A| - l/2)sgnA, 1/2 < |A| < r + 1/4 = A'(r, 1) (4.27) 
A, |A|>r + l/4 
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(b) If, on the other hand, r < 1/4, the function #(i, r ) satisfies 



0, \\\<y/f = \'(r,l) 
A, |A| > y/r 



(4.28) 



In all cases, the function i?( p , r ) is continuous except at \'(r,p), where i?( p , r ) has a 



holds that \'(r,p) > r while H^^X') < r. 

We leave the proof of Proposition 4.3.3 to the Appendix. 

Remark 4.3.4. In the particular case p = 2 corresponding to classical Mumford- 
Shah regularization (4.13), the thresholding function H( 2 , r ) '■ K — > R has a particu- 
larly simple explicit form: 



In addition to H^,r) and i?(i, r ), the thresholding operator H^/2,r)(X) corresponding 
to p = 3/2 can also be computed explicitly, by solving for the positive root of a 
suitable polynomial of third degree. In Figure 4.2 below, we plot H(2,i), #(3/2,1), and 
if(i 5 i) with parameter r = 1. For general noninteger values of p, i?( p , r ) cannot be 
solved in closed form. However, recall the following general properties of H( p , r )'- 

• #(p, r ) is an odd function, 



jump-discontinuity of size S(r,p) = |A' — H^ r ) ( A') | > «/ r > 0. In particular, it 




(4.29) 



• i?( Pif ,)(0) = 0, and 
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• H M (X) = A once |A| > r + \r*~ x . 

In fact, we can effectively ^recompute i?( p , r ) by numerically solving for the value of 
H(p >r )(\j) on a discrete set {Xj} of points Xj G (0, |r p_1 + r\. At Xj, one just needs 
to solve the real equation 

^ + - a ^ = ° ( 4 - 3 °) 

which can be computed effortlessly via a root-finding procedure such as Newton's 
method: while hj satisfies (hj — A.,) 2 + {hj) p < r p , set H^ r ^(Xj) = hj\ once this 
constraint is violated, set H^ r ^(Xj) = Xj. 
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(a) 




4.4 A connection to compressive sensing 

When p — 1 and r < 1/4, we know from Theorem 4.3.3 that the iterative algorithm 



u n+l = argminj^' 



(4.31) 



u 



reduces to the component-wise thresholding 



n+l 



^(K-w + r^), 



(4.32) 



where 



ff 7 (A) 



0, |A|<7 



(4.33) 



A, |A| > 7. 



This thresholding function : 1R — > 1R is referred to as hard-thresholding in the 
area of sparse recovery, and the iteration (4.32) generated by successive applications 
of hard thresholding has been previously studied [10]. In particular, the iteration 
(4.32) was shown in [10] to correspond to successive minimization in u for fixed a of 
the surrogate functional J^' surr (u, a) corresponding to the £q regularized functional, 



J?(u) = \\Tu - g\\\ {K) + r\\u\\ h)i x). 



(4.34) 



Here, the £0 quasi-norm ||u|| 4,(23 := J2iex \ u i\o i s defined component-wise by 





1, 



otherwise 
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It is not a coincidence that hard thresholding comes out from applying itera- 
tive thresholding to both the £ regularized problem (4.34), and free-discontinuity 
problem, 

J r \u) = \\Tu- g\\\ {K) + 5^min{|« i |,r} 
Indeed, consider the more general class of free-discontinuity-type functionals, 

J(r,a)W= \\Tu-g\\l {K) + aJ2^{\^lr}, (4.35) 

which have an additional degree of freedom in the scaling parameter a that is present 
in the discrete MS functional (4.5), but which was omitted in the previous section 
to ease computations. 

The free-discontinuity functional J^ ra ^ is related to the £ regularized functional 
F° as follows: 

Proposition 4.4.1. For any sequence (r n ) satisfying r n — > 0, the functional sequence 
^{-t/r n r n ) ^-converges to in the metric of ■ 

Proof. We have to verify the conditions for gamma convergence as established in 
Definition(4.1.1). Note that we can write 

= W Tu ~ 9\\ 2 e 2 (K) +7^min{|M j |/r n ,l}, 

so that T convergence of the full functional sequence Jhi^ r \ is established if we 
can show T convergence of the scalar function min{|a;|/r n , 1} to \x\ . 
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• Suppose that x n — > x. Our aim is to show that liminf rn ^ min{|a; n |/r n , 1} > 
\x\ : 

- If \x\ > e > 0, then \x n \ > e/2 for n sufficiently large, and 
lim rn ^ min{|a;' 1 |/r n , 1} = 1. 

— If on the other hand x = 0, then min{|x n |/r„, 1} > min{|x|/r n , 1} = 0. 

• On the other hand, it is easily seen that lim^^o min{|x|/r n , 1} = \x\ for any 
x G R, since 



lim min{|x|/r n , 1} 

Tn^O 



1, \x\ > 0, 
0, else. 



The desired result is established, according to (4.1.1). □ 

Remark 4.4.2. We showed that JL/ Tn Tn \ T converges to J 7 ® for sequences r n — > 0. 
More generally, J^ j ^ rn y rn ^ Y converges to for any p G [l,oo). It is not hard 
to show that the thresholding functions ^(p,7/(r-„)p,r 7l ) corresponding to J^j^y Tn y 
which can be derived following the proof of Proposition 4.3.3, converge to the hard 
thresholding function H ^ corresponding to J 7 ^. 

To ease presentation, we again take a = 1 in the sequel, although all of the following 
analysis extends to the case where a is a free parameter. Because a convergence 
analysis of the iteration (4.32) corresponding to hard thresholding has been studied 
already [10], we omit the case p = 1 and r < 1/4 in the sequel. 
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4.5 Convergence of the iterative thresholding al- 
gorithm (4.20) 

4.5.1 Fixation of the discontinuity set 

We prove now that the sequence (w n )„ e N defined by 

u n+1 = &igmmj r p ' surr {u,u n ) (4.36) 

u 

or equivalently, according to Proposition 4.3.3, component-wise by 

< +1 = H M ([u n -T*Tu n + T*g] t ), i e X, (4.37) 

will converge, granted that p > 1 and ||T|| < 1. To ease notation, we define the 
operator EI : ^(X) — > ^(X) by its component-wise action, 

\M(u)]i ■= H {p , r) {[u - T*Tu + T*g]i); (4.38) 

so that the iteration (4.37) can be written more concisely in operator notation as 

u n+1 = M(u n ). (4.39) 

We omit the dependence of EI on the parameters p, r, and the function g for continuity 
of presentation. At the core of the convergence proof is the fact that the 'discontinuity 
set', indicated below by X™, of u n must eventually fix during the iteration (4.37), at 
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which point the 'free-discontinuity' problem is transformed into a simpler 'fixed- 
discontinuity' problem. 

Lemma 4.5.1 (Fixation of the index set Xi). Fix p > l,r e R + , and <? e ii{K,). 
Consider the iteration 

u n+1 = U(u n ) (4.40) 
and the time- dependent partition of the index set X into 'small' set 

XI = {ieX:|<|<A'(r,p)} (4.41) 

and 'large' set 

XI = {ieX:|<|>A'(r,p)} (4.42) 

where X f (r,p) is the position of the jump discontinuity of the thresholding function, as 
defined in Proposition 4.3.3. For N e N sufficiently large, this partition fixes during 
the iteration u n+l = W(u n ); that is, there exists a set X such that for all n > N, 
Xq = X and X™ = X\ := X \ X . 

Proof. By discontinuity of the thresholding operator H^ r )(X), each sequence com- 
ponent 

< = %, r) (K _1 - T*Tu a ~ l + T*5f]j) (4.43) 

satisfies 

(a) < \'{r-,p) — S(r,p) < X'(r,p), if i & Xq, or 
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(b) K| > X'(r,p), if i e X™. 

Thus, |< +1 - <| > 5(r,p) if i e X ™ +1 f|^, or vice versa if i E T l f)l? +1 . At the 
same time, Lemma 4.3.2 implies 



a- 



n+1 -u?\<\\u n+1 -u n \\ h{I) <e, (4.44) 



once n > N(e), and e > can be taken arbitrarily small. In particular, (4.44) implies 
that X and X 1 must be fixed once n > N(e) and e < S(r,p). □ 

After fixation of the index set X = {i G X : < X'(r,p)}, M.(u n ) = Ui (-u n ) and 
Ux is an operator having component- wise action, for p > 1, 



F^([(/-T*T> + T*^), if.GX 
[U^uji = <( (4.45) 

((/-T*7> + T*(?) ifiGXx 



Here, as in Proposition 4.3.3, the function F' 1 is the inverse of the function 
F p (t) = t + |sgn£|t| p_1 . Again, for ease of presentation, we omit the dependence 
of Uj on the parameters p, r, and g. For p — 1 the description is similar, and in 
general, one easily verifies the equivalence 



Ux (f) = arg min jSf urr (u,v) (4.46) 



where J^ surr \ s a surrogate for the convex functional, 



Ji(u):=\\Tu-g\\l {!C) + J2\ui\ P - (4-47) 
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That is, fixation of the index set X implies that the sequence (u n ) ne ^ has become 
constrained to a subset of on which the map EI agrees with a map Ux , asso- 
ciated to the convex functional Jl Q . As we will see, this implies that the nonconvex 
functional J$ behaves locally like a convex functional in neighborhoods of fixed points 
u = W(u), including the global minimizers of J$ . 

4.5.2 On the nonexpansiveness and convergence for T injec- 
tive 

Given that W(u n ) = Ux (u n ) after a finite number of iterations, we can use well- 
known tools from convex analysis to prove that the sequence (w n ) n6 N converges. If 
the operator T*T : £2(1) —> is invertible, or, equivalently, if the operator T 

maps onto its range and has a trivial null space - as, for example, does the discrete 
pseudoinverse D\ in the ID Mumford-Shah approximation - then the mapping Ux 
has the nice property of being a contraction mapping, so that a direct application 
of the Banach fixed point theorem ensures exponential convergence of the sequence 
(u n ) n( zfq after fixation of the index sets. 

Theorem 4.5.2. Suppose T : li^C) — > ^{K,) maps onto ^{K,) and has a trivial null 
space. Let 5 > be a lower bound on the spectrum of T*T . Then the sequence 

u n+1 = U{u n ), (4.48) 

as defined in (4.38), is guaranteed to converge in norm. In particular, after a finite 
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number of iterations iVGN, this mapping takes the form 



u 



N+m 



= UZ(u N ), meN\{0}, 



(4.49) 



and the sequence (u n ) nGN converges to the unique fixed point u of the map Ux ■ More- 
over, after fixation of of the index setX , the rate of convergence becomes exponential: 



u N+m -u\\ HX) = \K(u N )-Uxl(u)\k^ < (l-*) m |K -*(x), mGN\{0}. 



The proof of Theorem 4.5.2 is deferred to the Appendix. 
4.5.3 Convergence for general operators T 

Unfortunately, if T*T is not invertible (that is, if 5 = belongs to its nonneg- 
ative spectrum), then the map Ux is not necessarily a contraction, and we can 
no longer apply the Banach fixed point theorem to prove convergence of the se- 
quence (w n ) ng N- However, as long as ||T|| < 1, we observe by following the proof of 
Theorem (4.5.2) that Ux is still non- expansive, meaning that for all v,v' G hi^), 
||Ui () (f) — Uz (i/) \\e 2 (x) < \\v — f'||^ 2 (x)- The following Opial's theorem [60], here re- 
ported adjusted to our notations and context, gives sufficient conditions under which 
non-expansive maps admit convergent successive iterations: 

Theorem 4.5.3 (Opial's Theorem). Let the mapping A from ^{T) to ^{T) satisfy 
the following conditions: 




(4.50) 
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1. A is asymptotically regular: for all v G l-£pj, ||A n+1 (t> ) — A n {v)\\i 2 (z) — > for 
n — > 00; 

2. A is non- expansive: for all v,v' G l-^C), ||A(t>) ~ &( v ')\\h{T) < \\ v — v '\\e 2 (i)> 

3. the set Fix(A) of the fixed points of A in £2^) is not empty. 

Then, for all v G £2^), the sequence (A n (t>)) nGN converges weakly to a fixed point in 
Fix(A). 

In fact, we already know that Ux is asymptotically regular, in addition to being 
nonexpansive - this follows by application of Lemma 4.3.1 and Lemma 4.3.2 to the 
functional J^. Thus, in order to apply Opial's theorem, it remains only to show 
that Ui has a fixed point; that is, that there exists a point u G £2(1) for which 



In more detail, we must prove the existence of a vector u G £2(1) satisfying 



The following lemma gives a simple yet useful characterization of points satisfying 
the fixed point relation (4.51): 

Lemma 4.5.4. Suppose p > 1. A vector u G £2^) satisfies the fixed point relation 



u = Ui (u). 




Ff\[(I-T*T)u + T*g] i ), if i G Z 
((I-T*T)u + T*g)., Hi el, 



(4.51) 
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u = Ux (u) if and only if 

[r*((/-Tti)]. = ^ (4.52) 
( Fp(tij) - iG X , 

Alternatively, if p = 1 and r > 1/4, u = Ui () (-u) zs satisfied if and only if 

[T*(g-Tu)].e [-1/2,1/2], 
< [T*(^-T«)]. = l/2sgn^, i ell (4.53) 
^ [T*(^-T«)]. = 0, iGXx, 

where in (4.53), the index set X is split into 

• Xq = {i e X : |-Uj| < 1/2}, and 

• X 6 = {iel : 1/2 < \ui\ < r + 1/4}. 

Again, recall the notation F p (t) = t + |sgnt|t| p_1 , and observe that the fixed 
point relation (4.52) has a very simple expression when p = 2. The proof of Lemma 
4.5.4 is given in the Appendix. 

The fixed point characterization of Lemma 4.5.4 will be crucial in the following 
theorem that ensures the existence of a fixed point u = Uz (u). We remind the 
reader that until now, all of the results of Section 1.3 remain valid in the infinite- 
dimensional setting |X| = oo. From this point on, however, certain results will only 
hold in finite dimensions; for clarity, we will account each such situation explicitly. 
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Proposition 4.5.5. In finite dimensions |Z| < oo, then there exist (global) mini- 
mizers of the convex functional, 

Jl(u) = \\Tu-g\\l {)C) + J2\ui\ p , (4.54) 

ieio 

for allp > 1, and any minimizer u of J^ o satisfies the fixed point relation u = Ux (u). 
Restricted to the range 1 < p < 2, the statement is true also in the limit |Z| = oo. 

Proof. In the finite-dimensional setting, minimizers necessarily exist for all p > 1 
according to Proposition 4.2.3. We now consider the general case. Consider the 
unique decomposition u = u + U\ into a vector u supported on Z and another 
Mi supported on Zi, i.e., the vectors u G ^f°(Z) := {u G ^(Z) : Ui — 0, i G Zi} 
and Mi G if (I) := {u G i 2 (T) : = 0, tG Z }. Let P : u ->■ uj and P x = 
Z — "P : u — > u denote the orthogonal projections onto the subspaces if (Z) and 
if (I), respectively. Consider the operators T = TV 1 and Ti = TP; note that 
clearly T = T + Ti is satisfied. The functional (4.54) can be re- written with this 
decomposition according to 

Jl(uo + «i) = \\T u + T lUl - g\\l {lc) + lko||^ 0(I) (4.55) 
where ||z||^i ^ := (X^ez \zi\ p ) 1 ^ P is the £ p -norm on vectors supported on Z . 

Let V\ be the orthogonal projection onto the range of Ti in £ 2 (^C) (not to be confused 
with V, which operates on the space ^(Z)) and let Vi = Z — V\ be the orthogonal 
projection in l^iJC) onto the orthogonal complement of the range of Ti. Then, fixing 
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u G ^(X), the vector Vi(g — T u ) e range(Ti) C ^(^C) is the solution to the 
minimization problem 



Pi(# - T u ) = arg min \\v - (g - T u )\\ 2 HK) , (4.56) 



so that minimizers of the functional T : £f°(X) ->■ R+ defined by 



7» = WToV + V^g-T^-gWl^ + M^^ 

= \\Kv-y\\l m + \\vV 4i){x) (4.57) 



with K := V^Tq, and y := "P^p, will yield minimizers of J^ Q . Functionals of the 
form (4.57) were studied in [30]; there, it is shown that as long as 1 < p < 2, T has 
minimizers, and any minimizer v can be characterized by the fixed point relation 

= F p \[(I - K*K)v + K*y} t ), i e X ; (4.58) 

(recall that F' 1 is the inverse of the function F p (t) = t + \ sgnt|t| p_1 ). 
In the finite-dimensional setting |X| < oo, the Euler-Lagrange equations correspond- 
ing to minimizers of the convex functional J 7 as in (4.57) imply the same fixed point 
relation (4.58) also, for all p > 1. 

By Lemma 4.5.4, the characterization (4.58) is equivalent to the condition 



• p > 1: 



[K*(y - Kv^.^^sgnvilv^' 1 , (4.59) 
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[K*(y-Kv)].e [-1/2,1/2], if \v t \ < 1/2, 
[K*(y-Kv)]. = l/2sgavj, if 1/2 < |^| < r + 1/4. 



i e Iq. (4.60) 



Making the identification u = v and Ti«i —Vi(g — T v), and rewriting K = V^Tq, 
and |/ = Vig, the relations (4.59) and (4.60) imply the full fixed point characteriza- 



Remark 4.5.6. The restriction p < 2 that is necessary for the results of this paper 
in the infinite dimensional setting \I\ = oo was only used in the proof of Theorem 
4.5.5, where it comes from [30] and is needed there to prove the existence of minimiz- 
ers of functionals J 7 of the form (4.57). If that proof can be extended to functionals 
of the form (4.57) for general p > 1, then the restriction p < 2 can be dropped in the 
current paper. For instance, if we additionally require that T is a bounded operator 
from £ P (T) to hi?) for 1 < p < oo then the existence of minimizers would be guar- 
anteed also for 1 < p < oo and \I\ = oo. In this case we could consider a minimizing 
sequence (v k ) of J 7 , which is necessarily bounded in t v . Therefore, there exists a sub- 
sequence (v kh ) which weakly converges in £ p to a point v*. This also implies the weak 
convergence of the sequence Kv kh in £ 2 ; note that {Kv kh ,w)e 2X e 2 = {v kh ,K*w)e p xe pl , 
for l/p + 1/p' — 1. By Fatou's lemma we obtain T{v*) < \im.m.{ h F(v kh ) and v* is 
a minimizer of J 7 . However, we still require that p > 1 for the proof of Proposition 
4.3.3 and for the results of the next section to hold. 

Combining the results from this section, we obtain: 



tion in Lemma 4.5.4. 



□ 
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Theorem 4.5.7. Suppose 1 < p < 2. Starting from any u° satisfying J^(u°) < oo, 
the sequence (w n ) ne N defined by u n+l = M n (u°) as in (4.38) will converge weakly to 
a vector u e hil) that satisfies the fixed point condition, 

1. \ui\ > X'(r,p), if i Eli = {j El : \uj\ > r} 

2- \ui\ < F-\\'(r,p)), forp> I, if i el = {j el: \uj\ < r}, and 
3. (a) Ifp>\: 



, 0, if\ui\ > X'(r,p) 

[T{g-Tu)]={ 71 (4.61) 

F p (ui) - u h if \ui\ < X'(r,p) - S(r,p) 



(b) Ifp = 1 andr> 1/4; 



[T*(g-Tu)].e [-1/2,1/2], |^| < 1/2 

[T*(g — Tu)]. = 1/2 sgn-Uj, 1/2 < |^| < r - 1/4. (4.62) 
[T*(g-Tu)]. = 0, \u t \>r + l/4. 

If the index set \1\ < oo is finite dimensional, the theorem holds for all p > 1. 

Proof. By Lemma 4.5.1, the map u n+1 = M.(u n ) becomes equivalent to a map of the 
form u n+1 = Vx (u n ) after a finite number of iterations N eN. By Lemma 4.5.1 and 
Proposition 4.3.3, the subset 1 C 1 separates 1 in the sense that, for all n> N, 

• \u?\<F-\X'(r,p)),Xiel , 

• \uf\ > X'(r,p), if % el x =X\X - 
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That the sequence (w ra )„ e N converges to a fixed point of the map Ux follows from 
Opial's theorem applied to the map Ux : 

1. the asymptotic regularity of Ux is a consequence of Lemmas 4.3.1 and 4.3.2; 

2. the nonexpansiveness of Ux follows from the proof of Theorem (4.5.2), and 

3. Theorem 4.5.5 guarantees that the set of fixed points of Ux in £2(1) is nonempty. 

The limit u of the sequence (u n ) will satisfy the fixed point conditions of Lemma 
4.5.4. Since weak convergence implies component-wise convergence, it follows for all 
i G Tq that 

\ui\ = lim \u™\ 

n— >oo 

< X'(r,p)-S(r,p) (4.63) 
and the respective lower bound \uf\ > A'(r, p) holds analogously for iE.Ii. □ 

4.6 On minimizers of 

We are now in a position to explore the relationship between limit vectors u of 
the iterative thresholding algorithm (4.38) and minimizers of the free-discontinuity 
functional J" r p (4.18). As a first but important result in this direction, 

Theorem 4.6.1. A point u satisfying the fixed point relation of Theorem 4.5.7 is a 
local minimizer of the functional J? defined in (4.18). 
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The proof of Theorem 4.6.1 is omitted at present but can be found in the Ap- 
pendix. This result should not be surprising, however. Due to the separation of the 
entries of any fixed point u, such that Ui < r < Uj for i G lo and j G Xi, we have 
also X = {i G X : \ui\ < r} and X x = {j G X : > r} for all u G -B(w, £(?")), where 
B(u,e(r)) is a ball around an equilibrium point u of radius e(r) > sufficiently 
small. On this neighborhood B{u,e{r)) of u, the functional ^ is convex. Since u is 
obtained as the limit of a sequence {u n ) in _B(/u, e(r)) for which the sequence Jp{u n ) 
is nonincreasing, one would expect that u minimizes J^{u n ) within this neighbor- 
hood. 

More surprising is that global minimizers of 3$ are also fixed points, as shown in the 
following theorem. Even though the existence of such minimizers is only guaranteed 
in the finite-dimensional setting (see Proposition 4.2.3), the following result is not 
restricted as such. 

Theorem 4.6.2 (Global minimizers of J* are fixed points u = M(u)). Any global 
minimizer u* of J7j? satisfies the fixed point condition of the map HI that is given in 
Theorem 4-5.7. 

The proof of Theorem 4.6.2 is rather long and we defer it to the Appendix. 
We reiterate once more that on a ball B(u,e(r)) around an equilibrium point u of 
radius e(r) > sufficiently small, the functional J? is convex; following the proof of 
Theorem 4.6.2, we see that J* is in fact strictly convex whenever u = u* is a global 
minimizer, since the restriction of T to the subspace ^(X) C ^(X) of vectors with 
support in X\ must be an injective operator in this case. Hence a global minimizer 
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is necessarily an isolated minimizer, whereas we cannot ensure the same property 
for local minimizers if T has a nontrivial null-space; in this case, local minimizers 
may form continuous sets, as it is shown in the bottom-right box of Figure 4.3. We 
conclude the following remark. 

Corollary 4.6.3. Minimizers of J* are isolated. 

4.7 Numerical Experiments 

4.7.1 Dynamical systems, stability, and equilibria 

Iterative thresholding algorithms have a natural interpretation as discrete-time dy- 
namical systems with nonsmooth right-hand-side, and can be associated to continu- 
ous dynamical systems of the type: 



The study of the existence, uniqueness, stability, and long-time behavior of these 
ODE's is of fundamental interest in order to clarify also the stability properties of 
iterative thresholding algorithms. Indeed, other than soft-thresholding iterations 
[30], the corresponding right-hand-side is not Lipschitz continuous and can even be 
discontinuous, as is the case for free-discontinuity problems. In [14, 42] conditions 
are established for the existence, uniqueness, and continuous dependence on the 
initial data (at finite time) of solutions of dynamical systems with discontinuous 



ii(t) 



F(u(t),t) 



T(H M (u(t)+T*(g-Tu(t)))-u 



(*)) , t>t , r > 0. 
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Figure 4.3: We show patterns in IR 2 formed by initial points u° colored according to 
the corresponding equilibria computed as limits of the iterative thresholding algo- 
rithm (4.37). For invertible 2x2 squared matrices T, the equilibria are isolated and 
the region of initial points for which (4.37) converges to a given equilibrium point do 
partition the space into sets which might be disconnected. Structures of the partition 
generated by a particular matrix T is exemplified on the left, and in the right box we 
show a pattern related to iterations where the 2x2 squared matrix T has nontrivial 
null-space. We can see again that global minimizers are isolated and correspond to 
the points on the axes, whereas local minimizers are continuously distributed along 
an affine space generated by the kernel of T. It is not difficult to show that this 
structure always occurs for such matrices. 

right-hand-side. However, very little is known about long-time properties of such 
dynamical systems and about the nature of their equilibrium points. 



For several continuous thresholding functions, such as the ones introduced in [30, 45, 
44], one can easily show, for instance by means of T-convergence arguments, that 
equilibrium points depend continuously on the parameters of the thresholding, see, 
e.g., [44, Theorem 5.1]. Nevertheless, for discontinuous thresholding functions i?( p , r ) 
such as those studied in this paper, sudden bifurcation phenomena and instabilities 
do appear in general. Figure 4.3 shows that multiple equilibrium points can ex- 
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ist for these thresholding operators and their number may depend discontinuously 
on the thresholding shape parameters. Moreover, as established in Theorem 4.6.2, 
global minimizers of J7j? are always stable equilibria and isolated points, while local 
minimizers can be unstable equilibria and form a continuous set, as shown in the 
bottom-right box of Figure 4.3. 

4.7.2 Denoising and segmentation of 1-D signals and digital 



In this subsection, we are concerned with numerical experiments in the use of an 
iterative thresholding algorithm for the minimization of 

N 

JrM := \\Dlu-g\\l +1 J2 m]n { u l r2 }> ( 464 ) 

i=i 

modelling problems of denoising and segmentation. 

Note that we introduced an additional regularization parameter 7 > which has 
the sole effect of modifying the thresholding function H( 2 ,r,j) as follows 



This thresholding function can be again easily computed by means of an argument 
similar to the proof of Proposition 4.3.3. In Figure ?? we show the results of ap- 
plications of the iterative thresholding algorithm (4.37). In Figure 4.5 we show a 
comparison of the use of the thresholding H( 2 ,r,-y) and the soft-thresholding 5 7 (see 
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Figure 4.4: We show the application of the iterative thresholding algorithm (4.37) 
for the classical denoising problem of 1-D signals where K = I in (4.10), and hence 
T = D* h . The thresholding parameters used for the numerics are r = 2.2 and 
7 = 0.002. 
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Computed denoised 1 D signal by iterative thresholding Computed denoised 1 D signal by iterative soft-thresholding 
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Figure 4.5: A comparison of the denoising of the signal in Figure ?? by means of the 
algorithm (4.37) and by iterative soft-thresholding [30] applied to discrete derivatives. 
We can appreciate how the algorithm (4.37) promotes piecewise smooth solutions, 
whereas the iterative soft-thresholding promotes the total variation minimization 
with the introduction of a 'staircase effect'. The thresholding parameters used for 
the numerics are r = 2.2 and 7 = 0.002 for (4.65), and 7 = 0.002 for the soft- 
thresholding (4.104). 
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its definition in (4.104)); the former promotes the minimization of the Mumford- 
Shah constraint MS and piecewise smooth solutions, whereas the latter promotes 
the minimization of a total variation constraint [67], which is also well-known to 
produce (almost) piecewise constant solutions with a perhaps unwanted 'staircase 
effect'; see also [23, Section 4] for details. 

4.7.3 Inverse problems 

As already mentioned in Subsection 4. 1.4 the Mumford-Shah term MS(u) = J n ^ s | Vn| 2 + 
f3'H d ^ 1 ( y S u ) is also used for regularizing inverse problems involving operators T which 
are not boundedly invertible. In this section we present a numerical experiment 
on the use of the algorithm (4.37) for ID interpolation (Figure 4.6). In this case 
the operator T is a multiplier by a characteristic function of a subdomain, i.e., 
Tu := xd ■ u, for D C f2; see [41] for other numerical examples previously obtained 
with the Mumford-Shah regularization. 

In Figure 4.6 we show the reconstruction of the noiseless signal of Figure ?? 
provided information only out of the interval [100, 150] which has to be restored. On 
the left boxes we show the results due to algorithm (4.37) and on the left ones the 
solution computed by iterative soft-thresholding. In the former the solution is again 
piecewise smooth and in the latter a (almost) piecewise constant solution is instead 
produced. 
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Interpolation of the signal computed with the iterative 
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Figure 4.6: Interpolation of an incomplete signal by means of the Mumford-Shah 
regularization and the total variation minimization provided by respective iterative 
thresholding algorithms. The red interval is the region where no information on the 
original signal is provided. The thresholding parameters used for the numerics are 
r = 2.2 and 7 = 0.002 for (4.65), and 7 = 0.002 for the soft-thresholding (4.104). 
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4.8 Appendix 



4.8.1 Proof of Proposition 4.2.3 

First, we recall Weierstrass' Theorem, which is used in the proof of Proposition 4.2.3 
below. 

Theorem 4.8.1 (Weierstrass' Theorem). The set of minima of a convex function f 
over a subset X C M. N is nonempty and compact if X is closed, f is lower semicon- 
tinuous over X , and the function f , given by 

(fix) , ifxeX, 
(4.66) 
oo otherwise, 

is coercive, i.e., for every sequence (xk) C X s.t. \\xk\\ — > oo, we have lim^oo f(xk) = 
oo. 

The following two lemmas will be helpful in the proof of Proposition 4.2.3. 

Lemma 4.8.2. Let F(u) be a convex function defined on having the general form 

, for some p > 1. Fix x and d in M. N . If F is 
bounded above and below on the ray {x + td,t > 0}, then F is constant on the line 
x + td. 

Proof Let /i(t) = F(x+td), and note that ji is convex because F is convex. Moreover, 
jj, has the general form ji{t) = P(t)+'^2 1< j <N Cj\\xj+tdj\\ p where P(t) is a polynomial 
in t of order at most 2. Without loss of generality, suppose < fi(t) < 1 for all values 



F(u) 



u 



Au + b f u + Y, 



l<j<N \ U 3 
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of t G M + . Then there exists a sequence of points (t n ) neN , t n — > oo for n — > oo, for 
which fi(t n ) is a convergent sequence; let us denote the limit of this sequence by 7. 

I. Case 1: 1 < p < 2. To repeat, 



Since = lim n _ ) . 0O //(i n )/^, it follows that all coefficients in /j,(t) of degree 2 
must vanish. In turn, then, = lim^^ fi(t n )/t^, has the implication that 
for each j, one of the coefficients Cj or dj must vanish as well. Following in 
the same manner, we conclude that all linear coefficients in //(£) also vanish, 
leaving only the possibility that p,(t) = 7 is a constant function. 

2. Case 2: p > 2: The proof in this case is identical to that of the previous case, 
and as such we leave the details to the reader. 



Lemma 4.8.3. Suppose F is a convex function defined on R that is bounded from 
below, and has the property that if F is bounded above on a ray {x + td,t G 
then F is constant on the line x + td. Then if F is constant on the line x + td, F is 
also constant on any parallel line y + td. 

Proof. Let fi(t) = F(x + td) which by assumption is a constant function /i(t) = 7, 
and let v(t) = F(y + td). Fix t G 1R + , and let z be the point z = x + 2{y — x), i.e. 




(4.67) 



□ 
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y = \x + \z. By convexity of F, we have that 




(4.68) 



for a constant a. It follows that F is bounded above by a on the ray {y + td, t G 



We now prove Proposition 4.2.3. Choosing xq G X, we define the (nonempty) set 



Obviously, the set M is convex and closed. By assumption, F is bounded from below 
on X and hence on M. Therefore, if M is bounded, then Weierstrass' Theorem yields 
the desired result. 

Thus, we may assume that M is unbounded. Then, the convexity of M implies 
that M contains a ray r = {z + td, t > 0}. Denote by r l5 r 2 , ...,rj a set of J rays 
in M corresponding to linearly independent vectors di, dj, so that any ray in M 
can be expressed as a linear combination of the ri,...,rj. By definition of M and 
by the assumption, F is bounded on M, hence, F is constant on each of the the 
lines Zj + tdj, according to Lemma (4.8.2). From Lemma (4.8.3), it follows that F is 
constant along each line x + tdj for arbitrary x G M. N , from which we deduce that F 
is constant along any line x + td for arbitrary d G Y = spanjrfi, dj}. Thus, we 
project X onto the subspace of WL N that is orthogonal to Y; call this subspace X. 



from which it follows, by assumption, that F is constant on the line y + td. 



□ 



M := in {x G R N ,F 



(x) < F(x )}. 



(4.69) 
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From the foregoing arguments, we have 



inf F(u) = miF(u) 
x x 



(4.70) 



As X is still a convex polyhedral set, and by construction M = X n {x e IR^} 
contains no rays, Weierstrass' Theorem yields the desired result. 

4.8.2 On uniform boundedness of \\D* h \\ 

The aim of the second part of the appendix is to prove the uniform bound \\D\W < 1/2 
eluded to in Section 3.1. Again, \\A\\ denotes the spectral norm of the matrix A, and 
D{ : W 1 - 1 ->■ W 1 is the pseudo -inverse of the discrete derivative matrix D h as given 
by (4.6), with the identification n = \ l/h\. From the expression for D h) and the 
knowledge that DhD h = I is the identity operator and D^Dh = (D h Dh)* is self- 
adjoint, the nx(n-l) matrix D^ h is identified as follows: 



/-(n-l) -(n-2) -(n-3) -1 \ 



1 



(n-2) -(n-3) 



-1 




1 



(4.71) 



ri 



2 



1 



2 



(n-3) 



-1 



V 



1 



2 



3 
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It is well-known that the spectral norm of an m x n matrix can be bounded by the 
more manageable entry-wise Frobenius norm, according to 



\\M < \\A\U 



\ 



EEi^-i 2 - ( 4 - 72 ) 



As such, we need only to bound the sum of the squares of the entries of D h . The 
sum S l n = Y^JjZi \di,j\ 2 over entries in the first row of D\ is given by S\ — (n — 
l)(2n - l)/(6n 3 ), using the familiar formula E^Li J 2 = l N ( N + + !)■ The 

analogous sum over entries in the j th row of D\ is seen inductively to satisfy S 3 n = 

Si - + J -^r^-- The total sum S n = Y!j=i s i is tnen S n = | - ^ > and we arrive 
at the desired uniform bound: 



\Dl\\< v / ^<^<!/2. (4.73) 



4.8.3 Proof of Proposition 4.3.3 

In order to help the reading of the current proof, as well as the proofs of Theorem 
4.5.7 and Theorem 4.6.2 in later appendices, we report in Table 1 the notation of the 
functions used in the proof of Proposition 4.3.3 for the definition of H^ r y 
Consider the functions 

L p (t,\) = (£-A) 2 + min{|t|V P }, (4.74) 
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L p (t,X) 


= (t- A) 2 + min{|t|V p } 


G p (t,X) 


= {t-xy 2 + t\p 


F P (t) 


= t + lsgnt\t\ p -\ p> 1 


S P (X) 


= G P (F-\X), X) = (F~\X) - A) 2 + F~\X)\P, p > 1 




= argmin t>0 L p (t, X) for general A > 0, p > 1 




= argmin <t< r G p (t, A) = F~ 1 (X) for < A < r 




= f F p \X), if G p (Fp 1 (X), A) < r p for ^ > r 
\ A, else 



Table 4.1: Notation of the functions involved in the definition of -ff( p , r ) as in the proof 
of Proposition 4.3.3. 



and 

G p (t,X) = (t- A) 2 + \t\ p . (4.75) 
The proof reduces to solving for 

H (p,r) (A) = arg min L p (t, A) (4.76) 

as a function of A 6 M. Since L p (t, A) = L p (—t, — A), the function if( Pir )(A) will be 
odd, and since also iJ( Pjr )(0) = 0, we can, without loss of generality, restrict the 
domain of interest to A > 0. On this domain, if( Pif ,)(A) = argmin ie iK L p (t, A) is non- 
negative, since L p (t,X) < L p (—t,X) when t > and A > 0. Hence, we can restrict 
the minimization of L p (t, A) to t > 0. 

It will be convenient to split the proof into two cases: 1 < p and p — 1. 
1. We first analyze the case 1 < p. 
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Note that 



argminL p (t, A) = argmin(t — A) 2 

t>r t>r 

= max{A,r}, (4.77) 

so that the minimization (4.76) naturally splits into the following two cases: 

(a) If A < r, the minimizer has to be searched in [0,r], hence 

H M (X) = arg nrin G p (t, A) = F^A) < A (4.78) 

where F~ 1 (X) is the functional inverse of the increasing, and continuous 
function 

F p {t)=t + P -sgnt\t\ p -\ (4.79) 

(b) On the other hand, if A > r, the minimizer has to be searched in [0,A], 
hence 



#(p,r)(A) 



F-\\), if Gp(F p ' 1 (\), A) < r p 
A, else 



By implicit differentiation of the functional relation F p (F p = A, it is clear 
that the functions F p 1 (X) and S p (\) := G p (F p 1 (\), A) are strictly increasing 
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functions in A. Indeed, we have the bounds 

< ^(A) = (^(F-^A)))- 1 = (l + P -^(F-\X)r^ " < 1, 
and 

^^(F-^A), X)^F-\X) + ^G p (F p ~\X), A) 
MF p -\X) - A) + P (F p - 1 (X)Y- 1 )±F-\X) - 2(F p \X) - A) 
2 (l - ^p _1 (A)) (A - F-\X)) +P^F-\X)(F-\X)r 1 > 0, 

since < ^-F p _1 (A) < 1, and 

< F p -\X) < A. (4.80) 

Also observe that F~\r + fr^ 1 ) = r, and S p {r + fr^ 1 ) = r p + ^r 2p - 2 > r p . 
This leads us to immediately conclude that 

(i) If A < r, then H M (X) = F'^X) (from (4.78)). 

(ii) If A > r + lr p ~\ then S P (X) = G p (F p \X), A) > r p , so that H M (X) = A. 

(iii) Since S p (r) < r p while S p {r-\-^r p ~ 1 )) > r p , the intermediate value theorem 
implies that there exists a unique value X'(r,p) lying strictly within the 
interval (r, r p_1 (| + r 2 ~ p )) at which 

S r (X't = r", (4.81) 
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and 




(A) A<A'(r,p) 
A> X'(r,p) 



(4.82) 



At A', iJ( Pir )(A') = argmin t > L p (t, A') is not uniquely denned and is real- 



for the sequel; as will be made clear, this will not cause problems in the 
ensuing analysis. Finally, note that 

(iv) At A', the function H( p ^ has a discontinuity S(r, p) = X — /J( pr )(A / ) that 
is strictly positive, as long as r > 0. Indeed, on the one hand, we know 
that A'(r, p) > r, on the other hand, Hi yP , r ){\') < r. This follows because 



2. The analysis of the case p — 1 is left to the reader since it follows a similar 
argument as for p > 1. 

4.8.4 Proof of Theorem 4.5.2 

We assume that the operator T*T : £2(1) _ > ^(Z) is nonnegative, so that its spec- 
trum lies within an interval [5, 1] with 5 > 0, and the operator / — T*T has norm 
|| j _ rp*rp^ < i _ $ j n particular, if T*T is invertible, then the inequality 5 > is 
strict, and so \\I - T*T\\ <1-S<1. 



ized at F p and at A'. In this case, we identify H( p ^(\') = F p 1 (X) 



H(p,r)(X) 



F-\X'), and 



(^ 1 (A / )) P <(^ 1 (A / )-A / ) 2 + l^- 1 (A' 



!)\p = S p (\')=t*. 
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We wish to show that the map Uj with component-wise action 



[U2b«] 



F p \[(I-T*T)u + T*g] t ), if* e Z 
((I-T*T)u + T*g)., Hi el, 



(4.83) 



is a contraction. To this end, let v,v' be arbitrary vectors in £2(1) 



1. If the index % e X 1; then 



|[U2b(«)]i " [U2b(t/)]il = I [(/ - T*T){v - v%\; 



2. If the index i e X , then we split the analysis in two cases p > 1 and p — 1: 



(a) for p > 1, we have 



|[U Io (^-[U Io (^) 



F- 1 ^/ - T*T> + T*g]i) - F-\[(I - T*T)v' + T*g]i) \ 

d F p \o[(i-T*m 



[v — v 



< \[{I-T*T){v-v')l 



(4.84) 



where the second equality is an application of the mean value theorem, 
which is valid since F p 1 {\) is differentiate. The final inequality above 
follows from implicit differentiation of the relation 

F p -\F p (t))=t 



and the observation that \-^F p {t)\ > 1 (see the proof of Proposition 4.3.3); 
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(b) for p — 1, by analyzing all cases, we get also that 



\[Ui (v)]i-lVx (v%\ < \[(I-T*T)(v-v> 



')] 



(4.85) 



Together, we have 



l|Ux (^)-U Io K)||, 2 2(x) 



< 



\\{I-T*T)v-v'\\l (x) 



< \\I -T*T\\ 2 \\v -v 




< (l-5)\\v-V 




(4.86) 



As Uj is a contraction, we arrive at the stated result by application of the Banach 
Fixed Point Theorem. 

4.8.5 Proof of Lemma 4.5.4 

If i Eli, then Ui = Ui+ [T*(g—Tu)~\., which is satisfied if and only if [T*(g—Tu)]. = 
as stated. It remains to analyze the case i e X , and, again, we split the argument 
in the cases p > 1 and p — 1. 

1. First suppose p > 1. Using the notation A = Wj+ [T*(g — Tu)~\., the fixed point 
characterization (4.51) translates to 
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But of course A = F p {uj) is the unique value at which F p X (A) = u iy and so this 
implies that 

[T*{g — Tu)\. = Fp{ui) — Hi, (4.87) 

and, by reversing operations, the relation (4.87) in turn implies the fixed point 
condition (4.51). 

2. The case p — 1, which is similar, is left to the reader. 
4.8.6 Proof of Theorem 4.6.1 

The proof will be much simplified by the following lemma which characterizes vectors 
such as u that satisfy the fixed point relations (4.61) or (4.62): 

Lemma 4.8.4. If u and v are such that 

Jr urr (u + v,u) - \\v\\ 2 HI) > Jr urr (u,u) = J r p (u), (4.88) 
then J r p (u + v) > J r p (u). 

Proof. For any u and v, the following holds because ||L|| < 1: 

J?(u + v) = Jr urr (u + v,u) - \\Lv\\l {x) > Jr urr (u + v,u) - \\v\\l {1) . (4.89) 

If in addition u and v satisfy (4.88), then the desired result is achieved by virtue of 
the equality J r p ' surr (u, u) = J p (u). □ 

Let us show now the proof of Theorem 4.6.1. By Lemma 4.8.4, it suffices to show 
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that at a fixed point u defined by (4.61) or (4.62), any perturbation Sh G ^(X) with 
norm ||5/i||^ 2 (x) < min{[A'(r,p) - r], [r - H M (\')]} will satisfy 



JP,surr(- + 6Ku) _ JV^urr^ > \\ S h\\l {x) . ( 4 . 90 ) 

After expanding the left-hand-side above, the inequality (4.90) is seen to be equiva- 
lent to 

2 Shi[T*(Tu -g)]i + [™™{\ui + $h i \ p : r p }-mm{\u i \ p : r p }\ > 0. (4.91) 

At this point, it is convenient to consider the summation over i G I and i G Ii 
separately. 

By Lemma 4.5.1, the first summand above vanishes over X\ and 

1. if 1 <p, then £\ Gl 5^[T*(Tm - #)]; = - X^ e z ^ s g n ^f Kl 1 *" 1 ; 

2. ifp = l,then^. 6l 5/i i [T*(TiZ-^] i = -1/2^^ 8h i 8giiu i + , £ ie3? 5h i [T*(Tu- 
9)]i- 

With respect to the second summation, observe from Proposition 4.3.3 that for all 
1 < P, \ui\ > \'(r,p) > r for « G X 1; so that this summation vanishes over X x for 
any perturbation <5/i satisfying the component-wise inequality \Shi\ < A'(r,p) — r. 
Similarly, |-Uj| < H^^X') < r for i G X , so that for any perturbation 5/i satisfying 
component-wise \Shi\ < min{[A'(r,p) — r], [r — f/^^A')]}, we have that 

^ [min{|^ + <^,r p }-min{|^|V p ^ = ^ |w, + <5^| p - (4.92) 

194 



The desired result follows if we can show that 

1. 1 < p < 2: [\ui + 5hi\ p - \ui\ p - ShipisgnUiWuilP- 1 ] > 0, for all i E X 

2. p= 1: 

(a) |o7ij + tij | — — <5/ij[sgntij]] > for all i e Xq, and 

(b) 5hi[T*(Tu - g)]i + |o7i;| > 0, for all i E X a . 

The inequality in 2(6) follows directly from Lemma 4.5.1; by symmetry, 1 and 2(a) 
follow if, for any u > 0, 

min \f(v) := \u + v\ p — u p — pu p ~ 1 v 1 = min (u + f ) p — u p — pu p ~ 1 v > 0. (4.93) 

When p — 1, the right-hand-side is identically zero and the result holds. When 
1 < p < 2, differentiating the right-hand-side gives that f(v) has a local minimum 
at v — 0, at which /(0) = 0, and, at the endpoint, f(—u) — {p — l)ti p_1 > 0. 

4.8.7 Proof of Theorem 4.6.2 

Suppose that u* is a minimizer of the functional J p . Consider the partition of the 
index set X into X = {i E X : \u*\ < r} and X 1 = {i E X : \u*\ > r}, and note that 
< oo, or else \ J p {u*)\ would not be finite. As in the proof of Theorem (4.5.5), 
consider the unique decomposition u* = Uq + u\ into a vector u* supported on X 
and another u\ supported on X\. Again, let V : u — > tii and V L = X — V : u — > u 
denote the orthogonal projections onto the subspaces ^f x (X) and ^f°(X), respectively, 
and consider the operators T = TV 1 - and T\ = TV. 
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By minimality of u*, if we fix Uq, the vector u\ satisfies u\ = arg min^ g ^i 1 ^ J^ x [z\ 
where 

:= \\T 1 z-(g-T u*)\\l {J) + J2™H\zi\ P ,r p }. (4.94) 

ieXi 

Since all coefficients in u\ have absolute value > r, the vector u\ also minimizes 

the functional 

\\T lZ -(g-T Q ul)\\l {J) , ( 4 - 95 ) 

or, else, the vector z* minimizing (4.95) would satisfy Jr,i( z *) < <J7,i( u i)i contradict- 
ing the minimality of u\. In fact, u\ must be the unique vector minimizing (4.95). 
For, if another vector u' also minimized (4.95), then the operator 7\ would have a 
nontrivial null space containing the span of some nonzero vector v, so that all vec- 
tors in the affine space + tv : t G M} would be minimal solutions for (4.95). In 
this case, we would have also the freedom of choosing from this affine subspace a 
vector u' having one coefficient u\ satisfying |^| < r. But such a vector u' satisfies 
<^r,i( u ') < ^r,i( u i)^ contradicting the minimality of u\. 

It follows that the operator T\ must have trivial null space, and u\ is the unique 
minimal least squares solution to (4.95), well-known to be explicitly given by 

u\ = (T^y^ig-Toul), (4.96) 
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so that Tiu\ is the unique orthogonal projection of (g — T q Uq) onto the range of Ti. 
Actually V\ = Ti(Ti*Ti) _1 Ti* is the orthogonal projection onto the range of Ti, due 
to the non-triviality of the null space of Ti. Therefore we have Tiu[ =V\(g — Tqu* ). 
It easily follows that 

T*(T lU l-(g-T u* ))=0, (4.97) 

or, in other words, 

[T*(g - TV)] . = 0, for all i G X x . (4.98) 

Now, on the other hand, by observing that any optimal variable u\ for fixed uq 
depends on uq via the relationship u\ = (Tj"Ti) 1 T 1 *((? — T uo), we easily infer that 
the vector Uq minimizes 

J r p o(v) = \\rHnv-g)\\l {J) + J2^{\ v i\ p ^ ( 4 -") 

where Vi denotes the orthogonal projection operator onto the orthogonal comple- 
ment of the range of Ti. 



Consider the convex functional, 



Hv) ■= WV^Tov - g)\\l 2iJ) + |M|^ 0(i) , (4.100) 



and note that J^iu) < J- \u), while at the same time J7 r p (-Ug) = J-{u^) by virtue of 
the fact that \u*\ < r. For p > 1 it follows that Mq is also a minimizer of Fiu), and 
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so satisfies the Euler-Lagrange equations [7], 



{T*Vt(T u* ~g)) + l mnulluir 1 = 0, (4.101) 



which imply the fixed point conditions 



[T*(g - Tu% = | sgn (wS)ilK)il P_1 » for a11 * e X . (4.102) 



For p = 1 one uses results from [30] to conclude that 

ul = S 1/2 « + ^(f? - W), (4.103) 

where S 7 is the so-called soft-thresholding, defined component-wise § 7 (t>) = (S 1 (vi))i e i, 
where 

{0, |A|< 7 
(4.104) 
A-asA, |A|> 7 . 

(Actually, [30, Proposition 3.10] only states that any fixed point of (4.103) is a 
minimizer of (4.100); nevertheless the converse also holds, see [43, Remarks (1), pag. 
2515].) The fixed-point condition (4.103) implies 

\T*(g - Tu*)} . E [-1/2, 1/2], |<| < 1/2 

> l < — ( 4105 ) 

[T*( 5 - Tu*)}. = 1/2 sgn <, 1/2 < |<| < r. 

It remains to verify that 

• \u*\ > \'(r,p), if i E Xx, and 
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• \u*\ < F p 1 (\'(r,p)), for p > 1, and \u*\ < r — 1/4, for p — 1, if % G X . 

We show these conditions for p > 1 only, as the case p — 1 is proved with an 
analogous argument. 

1. We first show that > \'(r,p) if 2 G Zi- From the first part of the proof, we 
know that at a minimizer u*, the functional J p {u*) can be written as 

J*{v?) = WV^Toul - g)\\l (K) + + llxlr" (4.106) 

Note that at this point we make explicit use of the finite cardinality of X\. Fix 
i G Xi and any perturbation /i = hiti, hi G K, along the coordinate i (here, is 
the vector of the canonical basis). Consider the rank-one operator = TVi, 
where we use V% to denote the orthogonal projection onto the one-dimensional 
subspace spanned by e^. Observe that | tiu\\ = \(u)i\\\ti\\. Since ti is orthogonal 
to the argument Vi(T Uq — g) under the £2 penalty in (4.106), the minimality 
condition J p {u*) < J p {u* + h) can be written as 



\T^(T ^-g)\\l^ + H\\^ + ^ 



< II^TotiS-^ll^ + lltiSII^ 

+ \\Hif m + min{r p , \u* + htf} + r p (|Xi| - 1) (4.107) 

which is equivalent to the condition that 

r p < \\h i U\\l (x) +mm{r p ,\u* + h l \ p } (4.108) 
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hold for all hi G K. Now, since ||T|| < 1, it follows that ||^|| < 1, and (4.108) 
implies that 

r p < h 2 { +min{r p ,\u* + hi\ p } (4.109) 

holds for all hi G R, or, after the change of variables a = u* + h iy that 

r p < (a-u*) 2 + mm{r p ,\a\ p } (4.110) 

holds for all a G R. In particular, the inequality (4.110) must hold at the value 
a* that minimizes the right-hand-side. But we already know from Proposition 
4.3.3 that such a minimizer a* is of the form: 

| F-^u*), \u*\ < X'(r,p) 
[ «*, \u*\ > X'(r,p) 

Now, suppose \u*\ < X'(r,p). (We know that > r, so then r < \u*\ < 
\'(r,p)). From the proof of Proposition 4.3.3 we know that the function F _1 (A) 
is increasing, so then a* = F^iu*) < F~ 1 (X') < r. Since also S p is strictly 
increasing, it follows that S p (a*) < S P (F- 1 (X')) < S P (X') = r p . In the last 
inequality we used (4.80). (See also Table 1 for recalling the notations used 
here.) But this is a contradiction to the minimality condition, (4.110), and so 
we must conclude that > X'(r,p). 

2. We now show that < F~ 1 (X'(r,p)), if < r. Recall that for % G X , the 
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coefficient u* satisfies the fixed point condition, 



[T*{g-Tu*)]. = P -sgnu*\u*r\ (4.112) 

Fix % G Z , and consider as before any perturbation h = h^i along the coordi- 
nate i, hi G R. Let ti be the rank-one operator as defined before. Then, the 
minimality condition J p {u*) < J p (u* + h) is easily seen to be equivalent to 

\\Tu*-g\\l {K) + \u*\ p < \\Tu* - g + KUWliK) 

+ min{r p , \u* + hi\ p } 
= \\Tu* ~ g\\l {K ) + \\hiti\\l {1) + 2h t (t t ,Tu* - g) 

+ min{r p , \u* + hi\ p } 
= \\Tu* - g\\l {K) + \\hM\lix) -2h i ^sgnu*\u*r 1 

+ mm{r p ,\u* + hi\ p } (4.113) 

and the final equality follows directly from the fixed point condition (4.112). 
Now the chain of inequalities (4.113) implies the minimality condition 

KT < ||M l ||, 2 2( i)-2/ ii |sgn<|<r 1 + min{rM< + ^n 

< -2h i ^sgnu*\u*\ p ~ 1 + mm{r p ,\u* + hi\ p }, (4.114) 
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or, again using the change of variables a = u* + hi, the inequality 



KT <(a- u*) 2 - 2(a - u*)^ sgau^u*^ 1 + min{r p , \a\ p }. (4.115) 



Again, the inequality (4.115) should hold for all a by the minimality of u* . 
Minimizers a* of the right-hand-side of (4.115) also are minimizers of 

(a - (u* + V - sgn^Kr 1 )) 2 + min{r p , (4.116) 

which we know to have the form 

, F^K + fsgn^Kr 1 ), \ u *\ + E\u*\^<X'(r,p) 
a = \ (4.117) 

u* + lsgnu*\u*\P-\ \u*\ + IKr 1 > X'(r,p) 

But u* + | sgxiu*\u*\ p ~ l = F p (u*), so the above reduces to 



, u*, F p (u*) < \'(r,p) 
a* = { ~ (4.118) 

F p (u*), F p (u*)>X'(r,p) 



As before, the proof proceeds by contradiction. Suppose that F p (u*) > X'(r,p), 
so that a* = F p (u*) > X'(r,p) and S p (a*) > S P (X') = r p . Note that, by 
recalling F p (u*) = u* — \ sgn(u*)(u*) p ~ 1 , we have 

S p (a*) = (u* - F p (u*)) 2 + \u*\ p = \u*\ p + ^\u*\ 2p - 2 . (4.119) 

Plugging a* into the right-hand-side of (4.115), noting that A'(r, p) > r so that 
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I a* I >r, and rearranging, yields the inequality 

KT < r p - j\u*\ 2p ' 2 or S p (a*) = \u*\ p + ^|m*| 2p " 2 < r p . (4.120) 

But this contradicts the assumption that the expression in (4.119) be larger 
than r p . 
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